Mother's day Sentiment analysis - with spaCy
In this notebook I try to use a competition dataset of tweets reacting to Mother's day and classify their sentiments with spaCy
- Setup paths
- Class distribution
- Deep Learning approach with Spacy
- Text classifier with Spacy
- Prediction on test data
We will use the method from my previous post to clean the text.
from pathlib import Path
import pandas as pd
DATA_PATH = Path('dataset/')
DRIVE_PATH = Path(r"/content/drive/My Drive/Spacy/Pretrained")
train_data = pd.read_csv(DATA_PATH/'train.csv', index_col=0)
train_data.head()
Let's check average length of text before cleaning.
print(sum(
train_data['original_text'].apply(len).tolist()
)/train_data.shape[0])
train_data['original_text'].replace(
# Regex is match : the text to replace with
{'(https?:\/\/.*|pic.*)[\r\n]*' : ''},
regex=True, inplace=True)
Let's check the average length again.
The regex did it's job I suppose.
train_data.head()
In my previous exploratory post, I have seen the data and I think that the features other than the text may not be required, (ie)
- lang
- retweet_count
- original_author
-
0
must meanNeutral
-
1
meansPositive
-
-1
meansNegative
train_data['sentiment_class'].value_counts().plot(kind='bar')
Let's see some sentences with negative examples, I am interested why they should be negative on a happy day(Mother's day)
list_of_neg_sents = train_data.loc[train_data['sentiment_class'] == -1, 'original_text'].tolist()
pprint(list_of_neg_sents[:5])
Well some tweets actually express their feelings for their deceased mothers. This is understandable.
We can use traditional NLP methods or deep learning methods to model the text. We will try the deep learning in this notebook .
It's recommended here that to improve performance of the classifier, Language model pretraining is one way to do so.
Spacy requires a .jsonl
format of input to train text
Get texts from the dataframe and store in jsonl
format more about that here.
We can also load the test data to get some more sample for the pretraining
, this will not cause Data Leakage because we are not giving any labels to the model.
test_data = pd.read_csv(DATA_PATH/'test.csv', index_col=0)
test_data.head()
Let's clean the test set for links as well
test_data['original_text'].replace(
# Regex pattern to match : the text to replace with
{'(https?:\/\/.*|pic.*)[\r\n]*' : ''},
regex=True, inplace=True)
test_data.shape
texts_series = pd.concat([train_data['original_text'], test_data['original_text']], axis='rows')
Let's check the length
texts_series.shape[0], train_data.shape[0]+test_data.shape[0]
So now we can use this texts_series
to create the jsonl
file.
list_of_texts = [
# Form dictionary with 'text' key
{'text': value}
for _, value in texts_series.items()
]
I will use srsly
to write this list of dictionaries to a jsonl
file
import srsly
# saving to my Google drive
srsly.write_jsonl(DRIVE_PATH/'pretrain_texts.jsonl', list_of_texts)
We can see a few lines from the saved file.
from pprint import pprint
with Path(DRIVE_PATH/'pretrain_texts.jsonl').open() as f:
lines = [next(f) for x in range(5)]
pprint(lines)
We should download a pretrained to model to use, Here I am using _en_core_webmd from Spacy
.
This can be confusing (ie) Why should I train a pretrained model, if I can download one, The idea is that the downloaded pretrained model would have been trained with a very different type of dataset, but it already has some knowledge on interpreting words in English sentences.
But here we have dataset of tweets which the downloaded pretrained model may or may not have seen during it's training, So we use our dataset to fine-tune the downloaded model, so that with minimum training it can start understanding the tweets right away.
!python -m spacy download en_core_web_md
%%bash
# Command to pretrain a language model
# Path to jsonl file with data
# Using md model as the base
# saving the model on my Drive folder
# training for 50 iterations with seed set to 0
python -m spacy pretrain /content/drive/My\ Drive/Spacy/Pretrained/pretrain_texts.jsonl \
/usr/local/lib/python3.6/dist-packages/en_core_web_md/en_core_web_md-2.2.5 \
/content/drive/My\ Drive/Spacy/Pretrained/ \
-i 50 -s 0 \
I have chosen to use the default parameters however one might need to change them for their problem.
We can see from the logs that the loss value in the last iteration is 18639
, but since the batch_size was 3000
our data must have splitted to 2
batches, (number of texts are 4622
) we should also take the previous log entry to account which is loss of 33658
, So the average of them would be 26148.5
, This number might be intimidating but the only way to check if it actually helps is to try to train a model with it.
If it doesn't then we can resume the training from the model saved on the last epoch.
We keep only the last model from the pretraining.
Let's train a text classifier with Spacy
Now that we have a pretrained model, We now need to prepare data for training the text classifier. Let's have a look at the data format that Spacy expects the data to be in.
{
"entities": [(0, 4, "ORG")],
"heads": [1, 1, 1, 5, 5, 2, 7, 5],
"deps": ["nsubj", "ROOT", "prt", "quantmod", "compound", "pobj", "det", "npadvmod"],
"tags": ["PROPN", "VERB", "ADP", "SYM", "NUM", "NUM", "DET", "NOUN"],
"cats": {"BUSINESS": 1.0},
}
This format works for training via code, as given in the examples above, There is also another format mentioned here
cats
is the only part we need to worry about, this must be where they look for categories/classes.
We have three classes in our dataset
-
0
forNeutral
-
1
forPositive
-
-1
forNegative
and they are mutually-exclusive (There can be only one label for a sentence)
We also need to split the training data we have to training and evaluation sets so that we can see how well our model has learnt the problem.
Let's try to programmatically generate the training data from pandas dataframe
label_map = {1:'POSITIVE', -1:'NEGATIVE', 0:'NEUTRAL'}
We need list of tuples of text and the annotation details in a dictionary as mentioned above.
train_json = [
# Get the text from dataframe row
(tweet.original_text,
{'cats':{
label_map[tweet.sentiment_class]:1.0
}
})
for tweet in train_data[['original_text', 'sentiment_class']].itertuples(index=False, name='Tweet')
]
train_json[0]
Now we will split the training data
from sklearn.model_selection import train_test_split
# Stratified split with labels
train_split, eval_split = train_test_split(train_json, test_size=0.2,
stratify=train_data['sentiment_class'])
len(train_split), len(eval_split)
We should save them as json
files to give them as input to the command line train
utility in spacy.
import json
with Path(DRIVE_PATH/'train_clas.json').open('w') as f:
json.dump(train_split, f)
with Path(DRIVE_PATH/'eval_clas.json').open('w') as f:
json.dump(eval_split, f)
Now should if we have enough data to train the model with train
spacy command in CLI, for that I will use Spacy's debug-data
command in CLI.
!python -m spacy debug-data -h
%%bash
(python -m spacy debug-data en \
/content/drive/My\ Drive/Spacy/Pretrained/train_clas.json \
/content/drive/My\ Drive/Spacy/Pretrained/eval_clas.json \
-p 'textcat' \
)
There must be something I missed now, I asked a question on stackoverflow regarding this, turns out we need to get .jsonl
format(again) and use the script provided in the repo to convert to the required json format for training, now I need to change the data generation a little bit to do that.
train_jsonl = [
# Get the text from dataframe row
{'text': tweet.original_text,
'cats': {v: 1.0 if tweet.sentiment_class == k else 0.0
for k, v in label_map.items()},
'meta':{"id": str(tweet.Index)}
}
for tweet in train_data[['original_text', 'sentiment_class']].itertuples(index=True, name='Tweet')
]
train_jsonl[0]
So instead of a list of tuples
now I have a list of dictionaries
. We need to split again to have an evaluation set
train_split, eval_split = train_test_split(train_jsonl, test_size=0.2,
stratify=train_data['sentiment_class'])
len(train_split), len(eval_split)
We still need to convert the jsonl
to the required json
format, now for that I will use the script named textcatjsonl_to_trainjson.py
in this repo. Let's download the script from the repo.
!wget -O script.py https://raw.githubusercontent.com/explosion/spaCy/master/examples/training/textcat_example_data/textcatjsonl_to_trainjson.py
%%bash
python script.py -m en /content/drive/My\ Drive/Spacy/train_texts.jsonl /content/drive/My\ Drive/Spacy
python script.py -m en /content/drive/My\ Drive/Spacy/eval_texts.jsonl /content/drive/My\ Drive/Spacy
Let's try to debug again
It worked !, Thanks to the answerer of this question, now we know that our data format is correct. Turns out there is another command to convert
our files to spacy's JSON format which is mentioned here.
The output is pointing out that the evaluation set has some data leakage. I will try to remove that now.
new_eval = [annot for annot in eval_split if all([annot['text'] != t['text'] for t in train_split])]
len(new_eval), len(eval_split)
We thought there were 5 samples leaking into the training data, it is six here, anyway let's try to validate the data again.
%%bash
(python -m spacy debug-data en \
/content/drive/My\ Drive/Spacy/train_texts.json \
/content/drive/My\ Drive/Spacy/eval_texts.json \
-p 'textcat' \
)
We are all set to start training now!
I have made the command to train in CLI, Please refer the comments for details in the order of the arguments given here
%%bash
## Arguement info
# Language of text in which the Model is going to be trained
# Path to store model
# Training data json path
# Evaluation data json path
# Pipeline components that we are going to train
# Number of iterations in total
# Nummber of iterations to wait before improvement in eval accuracy
# Pretrained model to start with
# version
# Augmentation for data(2 params)
# Model Architecture for text classifier (cnn + bow)
(python -m spacy train \
en \
-b en_core_web_sm \
/content/drive/My\ Drive/Spacy/Classifier \
/content/drive/My\ Drive/Spacy/train_texts.json \
/content/drive/My\ Drive/Spacy/train_texts.json \
-p "textcat" \
-n 100 \
-ne 10 \
-t2v /content/drive/My\ Drive/Spacy/Pretrained/fifty_iter/model49.bin \
-V 0.1 \
-nl 0.1 \
-ovl 0.1)
I also tried to train without the pretrained model (ie)en_core_web_sm
, The logs for that are here below. (Uncollapse to view), the results are not very different, the evaluation metrics are off the roof. We need to predict the test data and try to submit to the competition for a better picture of the model.
%%bash
## Arguement info
# Language of text in which the Model is going to be trained
# Path to store model
# Training data json path
# Evaluation data json path
# Pipeline components that we are going to train
# Number of iterations in total
# Nummber of iterations to wait before improvement in eval accuracy
# Pretrained model to start with
# version
# Augmentation for data(2 params)
# Model Architecture for text classifier (cnn + bow)
(python -m spacy train \
en \
/content/drive/My\ Drive/Spacy/Classifier_without_using_websm \
/content/drive/My\ Drive/Spacy/train_texts.json \
/content/drive/My\ Drive/Spacy/train_texts.json \
-p "textcat" \
-n 100 \
-ne 10 \
-t2v /content/drive/My\ Drive/Spacy/Pretrained/fifty_iter/model49.bin \
-V 0.1 \
-nl 0.1 \
-ovl 0.1)
test_data = pd.read_csv(DATA_PATH/'test.csv', index_col=0)
test_data.head()
We will clean the test data of links with regex as well.
test_data['original_text'].replace(
# Regex pattern to match : the text to replace with
{'(https?:\/\/.*|pic.*)[\r\n]*' : ''},
regex=True, inplace=True)
test_data.shape
list_of_test_texts = test_data['original_text'].tolist()
Let's load the Spacy model from our training
import spacy
textcat_mod = spacy.load(DRIVE_PATH.parent/'Classifier/model-best')
I will try to fasten the prediction by using multithreading as mentioned here
d = textcat_mod(list_of_test_texts[0])
d.cats
max(d.cats, key=lambda x: d.cats[x])
label_map = {'POSITIVE':1, 'NEGATIVE':-1, 'NEUTRAL':0}
preds = []
for doc in textcat_mod.pipe(list_of_test_texts, n_threads=4, batch_size=100):
pred_cls = max(doc.cats, key=lambda x: doc.cats[x])
preds.append(label_map[pred_cls])
len(preds), len(list_of_test_texts)
Let's form the submission
sub_df = pd.DataFrame(
preds,
index=test_data.index,
columns=['sentiment_class']
)
sub_df.shape
sub_df.head()
sub_df.to_csv(DRIVE_PATH.parent/'submission.csv')
The submitted predictions scored a mere 39/100
in weighted f1-score, that's disappointing. -_-
Let's analyze the predictions
sub_df['sentiment_class'].value_counts().plot(kind='bar')
sub_df['sentiment_class'].value_counts()
This looks very similar to the train data
train_data['sentiment_class'].value_counts()
What would have gone wrong?, I guess what I can do is try another method(traditional). Coming up in another post