Mother's day Sentiment analysis
In this notebook I explore a competition dataset of tweets reacting to Mother's day
This notebook explores the data from HackerEarth Machine learning challenge for Mother's day, The following is the Problem description.
You work in an event management company. On Mother's Day, your company has
organized an event where they want to cast positive Mother's Day related tweets
in a presentation. Data engineers have already collected the data related to
Mother's Day that must be categorized into positive, negative, and neutral
tweets.
You are appointed as a Machine Learning Engineer for this project. Your task is
to build a model that helps the company classify these sentiments of the tweets
into positive, negative, and neutral.
import requests
zip_file = requests.get('https://he-s3.s3.amazonaws.com/media/hackathon/hackerearth-test-draft-1-102/predicting-tweet-sentiments-231101b4/fa62f5d69a9f11ea.zip?Signature=v92IcNfljnopA9xQoCPCftwg1g0%3D&Expires=1590318817&AWSAccessKeyId=AKIA6I2ISGOYH7WWS3G5')
with open('data.zip', 'wb') as f:
f.write(zip_file.content)
!unzip data.zip
%load_ext google.colab.data_table
from pathlib import Path
import pandas as pd
DATA_PATH = Path('dataset/')
train_data = pd.read_csv(DATA_PATH/'train.csv', index_col=0)
train_data.head(100)
train_data.info()
So there are four columns
1. The tweet text
2. Language of the tweet
3. Number of Retweets
4. Sentiment group (+ve, -ve, neu)
There are also some missing values in lang
and retweet_count
columns.
Let's look at the number of languages of tweets this dataset has
train_data['lang'].value_counts()
I see that most tweets are in English
, but it seems that some entries in the data are not actually indicating any language.
print('Total Other language tweets: ', train_data.shape[0].item()-2994)
Let's see if they are actually in different languages.
train_data.loc[train_data['lang'] != 'en', :]
I think all the tweets are in English
, you can see that the en
value representing English is misplaced in other columns retweet_count
and original_author
. They are either filled with random float numbers or some other tweet text, maybe the data was not scraped properly?
prHowever from what I know, I think we can ignore the columns other than the original_text
column which has the tweet text, which is the most important for analysis for sentiment of text. You can see that almost all of the texts have some link embedded to them, They are not likely to help in getting to know the sentiment of the tweeter.
The pattern here is that most of the links either are images which start with pic.twitter.*
or links referring to other sites like http://www.instagram.*
, We should be able to identify the pattern with a Regular expression. Let's try to test the assumption.
import re
sample_tweet = """
Happy Mothers Day Mothers are very valuable to the society because they build families that make up the general population of every nation.
They also contribute immensely to nation building and capacity building as caregivers.....
https://www. facebook.com/36426361058377 0/posts/1130869587256498/ … #happymothersday2020 pic.twitter.com/ZCZOF1xb6K
wo"""
print(re.sub('(https?:\/\/.*|pic.*)[\r\n]*', '', sample_tweet))
This will remove most of the links, but it will also remove the text between the links like in the case above the #happymothersday2020
hashtag is removed.
Let's apply the regex to the texts
(train_data['original_text'].replace({'(https?:\/\/.*|pic.*)[\r\n]*':''}, regex=True).to_frame()).head()
import numpy as np
from io import BytesIO
from PIL import Image
from PIL.ImageOps import invert
img_file = requests.get('http://www.agirlandagluegun.com/wp-content/uploads/blogger/-ox_bazyTgmQ/TcNwpMfduLI/AAAAAAAAOX4/hcxXcz0A8-A/s1600/scan0001.jpg')
img = BytesIO(img_file.content)
img_mask = np.array(Image.open(img))
# Check if the image was downloaded properly
img_mask.shape
# From https://www.datacamp.com/community/tutorials/wordcloud-python
def transform_format(val):
# Just trying to invert pixels
if any(v == 255 for v in val):
return 0
else:
return 255
# Transform the mask into a new one that will work with the function:
transformed_mask = np.ndarray((img_mask.shape[0], img_mask.shape[1]), np.int32)
for i in range(len(img_mask)):
transformed_mask[i] = list(map(transform_format, img_mask[i]))
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
wc = WordCloud(background_color="white", max_words=2000, mask=transformed_mask,
stopwords=set(STOPWORDS), contour_width=1, contour_color='steelblue')
# generate word cloud
wc.generate('\n'.join(train_data['original_text'].values.tolist()))
%matplotlib inline
plt.figure(figsize=(18,18))
plt.axis("off")
plt.imshow(wc)
Now we can try to model this text with any method that we would like! and that is coming up next.