This notebook explores the data from HackerEarth Machine learning challenge for Mother's day, The following is the Problem description.

You work in an event management company. On Mother's Day, your company has
organized an event where they want to cast positive Mother's Day related tweets
in a presentation. Data engineers have already collected the data related to
Mother's Day that must be categorized into positive, negative, and neutral
tweets.

You are appointed as a Machine Learning Engineer for this project. Your task is
to build a model that helps the company classify these sentiments of the tweets
into positive, negative, and neutral.

Download the data

import requests

zip_file = requests.get('https://he-s3.s3.amazonaws.com/media/hackathon/hackerearth-test-draft-1-102/predicting-tweet-sentiments-231101b4/fa62f5d69a9f11ea.zip?Signature=v92IcNfljnopA9xQoCPCftwg1g0%3D&Expires=1590318817&AWSAccessKeyId=AKIA6I2ISGOYH7WWS3G5')

with open('data.zip', 'wb') as f:
    f.write(zip_file.content)
!unzip data.zip
Archive:  data.zip
   creating: dataset/
  inflating: dataset/train.csv       
  inflating: dataset/test.csv        
%load_ext google.colab.data_table

Peek at the data

from pathlib import Path
import pandas as pd

DATA_PATH = Path('dataset/')

train_data = pd.read_csv(DATA_PATH/'train.csv', index_col=0)
train_data.head(100)
original_text lang retweet_count original_author sentiment_class
id
1.245025e+18 Happy #MothersDay to all you amazing mothers o... en 0 BeenXXPired 0
1.245759e+18 Happy Mothers Day Mum - I'm sorry I can't be t... en 1 FestiveFeeling 0
1.246087e+18 Happy mothers day To all This doing a mothers ... en 0 KrisAllenSak -1
1.244803e+18 Happy mothers day to this beautiful woman...ro... en 0 Queenuchee 0
1.244876e+18 Remembering the 3 most amazing ladies who made... en 0 brittan17446794 -1
... ... ... ... ... ...
1.244092e+18 Happy Mothers Day from all of us at Kellyzola ... en 1 design_pia 0
1.246529e+18 Happy Mothers Day katiebrooks_88 . I do my lev... en 1 missny99 0
1.244747e+18 RESPECT, GRATITUDE and ADORATION to all MOTHER... en 0 EgbertsTreasure 0
1.245141e+18 It takes someone really brave to be a mother, ... en 0 momaferd -1
1.245903e+18 On Mother's Day, I'm sharing this video again ... en 0 GotMommyBrain -1

100 rows × 5 columns

train_data.info()
<class 'pandas.core.frame.DataFrame'>
Float64Index: 3235 entries, 1.24502457848689e+18 to 1.24540908968687e+18
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   original_text    3235 non-null   object
 1   lang             3231 non-null   object
 2   retweet_count    3231 non-null   object
 3   original_author  3235 non-null   object
 4   sentiment_class  3235 non-null   int64 
dtypes: int64(1), object(4)
memory usage: 151.6+ KB

So there are four columns

1. The tweet text
2. Language of the tweet
3. Number of Retweets
4. Sentiment group (+ve, -ve, neu)

There are also some missing values in lang and retweet_count columns.

Let's look at the number of languages of tweets this dataset has

train_data['lang'].value_counts()
en                            2994
 pink Peruvian opal! via         4
 Find More                       2
&gt                              2
WORLDS OKAYEST MOTHER! &lt       2
                              ... 
0.4754834129                     1
-0.0064143617                    1
-0.3850425633                    1
0.7885519508                     1
-0.2758448854                    1
Name: lang, Length: 232, dtype: int64

I see that most tweets are in English, but it seems that some entries in the data are not actually indicating any language.

print('Total Other language tweets: ', train_data.shape[0].item()-2994)
Total Other language tweets:  241

Let's see if they are actually in different languages.

train_data.loc[train_data['lang'] != 'en', :]
original_text lang retweet_count original_author sentiment_class
id
1.244590e+18 Happy mothersday to all those celebrating toda... -0.0138325017 en 11 0
1.244823e+18 Exactly what my late mum aka hype mama would d... -0.9677309496 en 0 0
1.246515e+18 It's the world's most difficult job No sick le... -0.3876905537 en 1 0
1.244226e+18 Happy Mother’s Day! To all the amazing Mums ou... 0.5309553602 en 0 0
1.244419e+18 Happy Mothers Day , Mummy! Nearly 90 and still... -0.045423609 en 2 0
... ... ... ... ... ...
1.246356e+18 Happy Mothers Day All My Nigerian Massive Fami... 0.2117897904 en 0 0
1.245821e+18 HAPPY MOTHERS DAY ! HAPPY MOTHERS DAY !!Now th... -0.8739088126 en 0 0
1.246719e+18 Isan Elba celebrates Happy Mothers’ Day with h... 0.4945825935 en 0 0
1.245565e+18 Still miss my mom she passed 18th of March 201... 0.6927740873 0 0 0
1.245871e+18 I’m so thankful for my 5, healthy happy joy br... 0.2522315249 en 1 0

241 rows × 5 columns

I think all the tweets are in English, you can see that the en value representing English is misplaced in other columns retweet_count and original_author. They are either filled with random float numbers or some other tweet text, maybe the data was not scraped properly?

prHowever from what I know, I think we can ignore the columns other than the original_text column which has the tweet text, which is the most important for analysis for sentiment of text. You can see that almost all of the texts have some link embedded to them, They are not likely to help in getting to know the sentiment of the tweeter.

The pattern here is that most of the links either are images which start with pic.twitter.* or links referring to other sites like http://www.instagram.*, We should be able to identify the pattern with a Regular expression. Let's try to test the assumption.

Cleaning links

import re

sample_tweet = """
Happy Mothers Day Mothers are very valuable to the society because they build families that make up the general population of every nation.
They also contribute immensely to nation building and capacity building as caregivers.....
https://www. facebook.com/36426361058377 0/posts/1130869587256498/ … #happymothersday2020 pic.twitter.com/ZCZOF1xb6K
wo"""
print(re.sub('(https?:\/\/.*|pic.*)[\r\n]*', '', sample_tweet))
Happy Mothers Day Mothers are very valuable to the society because they build families that make up the general population of every nation.
They also contribute immensely to nation building and capacity building as caregivers.....

This will remove most of the links, but it will also remove the text between the links like in the case above the #happymothersday2020 hashtag is removed.

Let's apply the regex to the texts

(train_data['original_text'].replace({'(https?:\/\/.*|pic.*)[\r\n]*':''}, regex=True).to_frame()).head()
original_text
id
1.245025e+18 Happy #MothersDay to all you amazing mothers o...
1.245759e+18 Happy Mothers Day Mum - I'm sorry I can't be t...
1.246087e+18 Happy mothers day To all This doing a mothers ...
1.244803e+18 Happy mothers day to this beautiful woman...ro...
1.244876e+18 Remembering the 3 most amazing ladies who made...

Visualize the words

Prepare the mask

import numpy as np
from io import BytesIO
from PIL import Image
from PIL.ImageOps import invert

img_file = requests.get('http://www.agirlandagluegun.com/wp-content/uploads/blogger/-ox_bazyTgmQ/TcNwpMfduLI/AAAAAAAAOX4/hcxXcz0A8-A/s1600/scan0001.jpg')

img = BytesIO(img_file.content)

img_mask = np.array(Image.open(img))
# Check if the image was downloaded properly
img_mask.shape
(718, 1600, 3)
# From https://www.datacamp.com/community/tutorials/wordcloud-python
def transform_format(val):
    # Just trying to invert pixels
    if any(v == 255 for v in val):
        return 0
    else:
        return 255

# Transform the mask into a new one that will work with the function:
transformed_mask = np.ndarray((img_mask.shape[0], img_mask.shape[1]), np.int32)

for i in range(len(img_mask)):
    transformed_mask[i] = list(map(transform_format, img_mask[i]))
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

wc = WordCloud(background_color="white", max_words=2000, mask=transformed_mask,
               stopwords=set(STOPWORDS), contour_width=1, contour_color='steelblue')

# generate word cloud
wc.generate('\n'.join(train_data['original_text'].values.tolist()))
<wordcloud.wordcloud.WordCloud at 0x7f0482536358>
%matplotlib inline

plt.figure(figsize=(18,18))
plt.axis("off")
plt.imshow(wc)
<matplotlib.image.AxesImage at 0x7f04825f85f8>

Now we can try to model this text with any method that we would like! and that is coming up next.