Setup paths

We will use the method from my previous post to clean the text.

from pathlib import Path
import pandas as pd

DATA_PATH = Path('dataset/')
DRIVE_PATH = Path(r"/content/drive/My Drive/Spacy/Pretrained")
train_data = pd.read_csv(DATA_PATH/'train.csv', index_col=0)
train_data.head()
original_text lang retweet_count original_author sentiment_class
id
1.245025e+18 Happy #MothersDay to all you amazing mothers o... en 0 BeenXXPired 0
1.245759e+18 Happy Mothers Day Mum - I'm sorry I can't be t... en 1 FestiveFeeling 0
1.246087e+18 Happy mothers day To all This doing a mothers ... en 0 KrisAllenSak -1
1.244803e+18 Happy mothers day to this beautiful woman...ro... en 0 Queenuchee 0
1.244876e+18 Remembering the 3 most amazing ladies who made... en 0 brittan17446794 -1

Let's check average length of text before cleaning.

print(sum(
    train_data['original_text'].apply(len).tolist()
)/train_data.shape[0])
227.42102009273572
train_data['original_text'].replace(
    # Regex is match : the text to replace with
    {'(https?:\/\/.*|pic.*)[\r\n]*' : ''},
    regex=True, inplace=True)

Let's check the average length again.

The regex did it's job I suppose.

train_data.head()
original_text lang retweet_count original_author sentiment_class
id
1.245025e+18 Happy #MothersDay to all you amazing mothers o... en 0 BeenXXPired 0
1.245759e+18 Happy Mothers Day Mum - I'm sorry I can't be t... en 1 FestiveFeeling 0
1.246087e+18 Happy mothers day To all This doing a mothers ... en 0 KrisAllenSak -1
1.244803e+18 Happy mothers day to this beautiful woman...ro... en 0 Queenuchee 0
1.244876e+18 Remembering the 3 most amazing ladies who made... en 0 brittan17446794 -1

In my previous exploratory post, I have seen the data and I think that the features other than the text may not be required, (ie)

  • lang
  • retweet_count
  • original_author

Class distribution

  • 0 must mean Neutral
  • 1 means Positive
  • -1 means Negative
train_data['sentiment_class'].value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7f2681b9b438>

Let's see some sentences with negative examples, I am interested why they should be negative on a happy day(Mother's day)

list_of_neg_sents = train_data.loc[train_data['sentiment_class'] == -1, 'original_text'].tolist()

pprint(list_of_neg_sents[:5])
['Happy mothers day To all This doing a mothers days work. Today been quiet '
 'but Had time to reflect. Dog walk, finish a jigsaw do the garden, learn few '
 'more guitar chords, drunk some strawberry gin and tonic and watch Lee evens '
 'on DVD. My favourite place to visit. #isolate ',
 'Remembering the 3 most amazing ladies who made me who I am! My late '
 'grandmother iris, mum carol and great grandmother Ethel. Missed but never '
 'forgotten! Happy mothers day to all those great mums out there! Love sent to '
 'all xxxx ',
 'Happy Mothers Day to everyone tuning in. This is the 4th Round game between '
 'me and @CastigersJ Live coverage on @Twitter , maybe one day @SkySportsRL or '
 'on the OurLeague app',
 "Happy Mothers Day ! We hope your mums aren't planning to do any work around "
 'the house today! Surely it can wait until next week? #plumbers '
 '#heatingspecialists #mothersday #mothersday ',
 "Happy mothers day to all those mums whos children can't be with them today. "
 'My son Dylan lives in heaven I wish I could see him for one more hug. I wish '
 'I could tell him how much I love and miss him. Huge happy mothers day to '
 'your mum too.']

Well some tweets actually express their feelings for their deceased mothers. This is understandable.

We can use traditional NLP methods or deep learning methods to model the text. We will try the deep learning in this notebook .

Deep Learning approach with Spacy

It's recommended here that to improve performance of the classifier, Language model pretraining is one way to do so.

Spacy requires a .jsonl format of input to train text

Get texts from the dataframe and store in jsonl format more about that here.

We can also load the test data to get some more sample for the pretraining, this will not cause Data Leakage because we are not giving any labels to the model.

test_data = pd.read_csv(DATA_PATH/'test.csv', index_col=0)
test_data.head()
original_text lang retweet_count original_author
id
1.246628e+18 3. Yeah, I once cooked potatoes when I was 3 y... en 0 LToddWood
1.245898e+18 Happy Mother's Day to all the mums, step-mums,... en 0 iiarushii
1.244717e+18 I love the people from the UK, however, when I... en 0 andreaanderegg
1.245730e+18 Happy 81st Birthday Happy Mother’s Day to my m... en 1 TheBookTweeters
1.244636e+18 Happy Mothers day to all those wonderful mothe... en 0 andreaanderegg

Let's clean the test set for links as well

test_data['original_text'].replace(
    # Regex pattern to match : the text to replace with
    {'(https?:\/\/.*|pic.*)[\r\n]*' : ''},
    regex=True, inplace=True)
test_data.shape
(1387, 4)
texts_series = pd.concat([train_data['original_text'], test_data['original_text']], axis='rows')

Let's check the length

texts_series.shape[0], train_data.shape[0]+test_data.shape[0]
(4622, 4622)

So now we can use this texts_series to create the jsonl file.

list_of_texts = [
                 # Form dictionary with 'text' key
                 {'text': value}
                 for _, value in texts_series.items()
            ]

I will use srsly to write this list of dictionaries to a jsonl file

import srsly
# saving to my Google drive
srsly.write_jsonl(DRIVE_PATH/'pretrain_texts.jsonl', list_of_texts)

We can see a few lines from the saved file.

from pprint import pprint

with Path(DRIVE_PATH/'pretrain_texts.jsonl').open() as f:
    lines = [next(f) for x in range(5)]

pprint(lines)
['{"text":"Happy #MothersDay to all you amazing mothers out there! I know '
 "it's hard not being able to see your mothers today but it's on all of us to "
 'do what we can to protect the most vulnerable members of our society. '
 '#BeatCoronaVirus "}\n',
 '{"text":"Happy Mothers Day Mum - I\'m sorry I can\'t be there to bring you '
 "Mothers day flowers & a cwtch - honestly at this point I'd walk on hot coals "
 "to be able to. But I'll be there with bells on as soon as I can be. Love you "
 'lots xxx (p.s we need more photos!) "}\n',
 '{"text":"Happy mothers day To all This doing a mothers days work. Today been '
 'quiet but Had time to reflect. Dog walk, finish a jigsaw do the garden, '
 'learn few more guitar chords, drunk some strawberry gin and tonic and watch '
 'Lee evens on DVD. My favourite place to visit. #isolate "}\n',
 '{"text":"Happy mothers day to this beautiful woman...royalty soothes you '
 'mummy jeremy and emerald and more #PrayForRoksie #UltimateLoveNG "}\n',
 '{"text":"Remembering the 3 most amazing ladies who made me who I am! My late '
 'grandmother iris, mum carol and great grandmother Ethel. Missed but never '
 'forgotten! Happy mothers day to all those great mums out there! Love sent to '
 'all xxxx "}\n']

Start Pretraining

We should download a pretrained to model to use, Here I am using _en_core_webmd from Spacy.

This can be confusing (ie) Why should I train a pretrained model, if I can download one, The idea is that the downloaded pretrained model would have been trained with a very different type of dataset, but it already has some knowledge on interpreting words in English sentences.

But here we have dataset of tweets which the downloaded pretrained model may or may not have seen during it's training, So we use our dataset to fine-tune the downloaded model, so that with minimum training it can start understanding the tweets right away.

!python -m spacy download en_core_web_md

Training results

%%bash
# Command to pretrain a language model
# Path to jsonl file with data
# Using md model as the base
# saving the model on my Drive folder
# training for 50 iterations with seed set to 0
python -m spacy pretrain /content/drive/My\ Drive/Spacy/Pretrained/pretrain_texts.jsonl \
    /usr/local/lib/python3.6/dist-packages/en_core_web_md/en_core_web_md-2.2.5 \
    /content/drive/My\ Drive/Spacy/Pretrained/ \
    -i 50 -s 0 \
ℹ Not using GPU
⚠ Output directory is not empty
It is better to use an empty directory or refer to a new output path, then the
new directory will be created for you.
✔ Saved settings to config.json
✔ Loaded input texts
✔ Loaded model
'/usr/local/lib/python3.6/dist-packages/en_core_web_md/en_core_web_md-2.2.5'

============== Pre-training tok2vec layer - starting at epoch 0 ==============
  #      # Words   Total Loss     Loss    w/s
  0       115177   114619.719   114619   7308
  0       177090   174673.695    60053   8123
  1       291933   282095.656   107421   8353
  1       354180   337893.113    55797   8156
  2       468951   432705.457    94812   8398
  2       531270   479373.527    46668   8271
  3       646206   557380.137    78006   8368
  3       708360   595962.348    38582   8151
  4       823108   662773.332    66810   8349
  4       885450   696823.672    34050   8125
  5      1000591   756743.684    59920   8254
  5      1062540   787816.266    31072   7943
  6      1177552   844198.828    56382   8380
  6      1239630   874128.219    29929   7996
  7      1354814   928725.262    54597   8291
  7      1416720   957604.215    28878   8081
  8      1531685   1010607.96    53003   8310
  8      1593810   1039022.08    28414   8006
  9      1708981   1091185.60    52163   8248
  9      1770900   1118857.03    27671   8032
 10      1885776   1169906.66    51049   8240
 10      1947990   1197361.49    27454   8015
 11      2062486   1247384.96    50023   8344
 11      2125080   1274605.26    27220   8153
 12      2239188   1323843.16    49237   8376
 12      2302170   1350702.81    26859   7941
 13      2416942   1399298.78    48595   8213
 13      2479260   1425267.50    25968   7968
 14      2594827   1473118.10    47850   8357
 14      2656350   1498307.73    25189   7879
 15      2770909   1544661.96    46354   7572
 15      2833440   1569846.84    25184   7980
 16      2948496   1615382.64    45535   8137
 16      3010530   1639715.23    24332   7848
 17      3125700   1684541.43    44826   8316
 17      3187620   1708564.64    24023   7941
 18      3302357   1753053.58    44488   8039
 18      3364710   1776999.23    23945   7234
 19      3480485   1821211.88    44212   8191
 19      3541800   1844436.19    23224   7859
 20      3657480   1887923.85    43487   8170
 20      3718890   1911031.40    23107   7902
 21      3833871   1954109.82    43078   8253
 21      3895980   1977017.92    22908   7840
 22      4011686   2019201.90    42183   8096
 22      4073070   2041539.05    22337   7906
 23      4188348   2083159.02    41619   8148
 23      4250160   2105332.99    22173   7882
 24      4364945   2146261.39    40928   8127
 24      4427250   2168323.78    22062   8017
 25      4542920   2209202.75    40878   8059
 25      4604340   2230725.28    21522   7823
 26      4719450   2271131.96    40406   8272
 26      4781430   2292648.11    21516   7986
 27      4896609   2332466.73    39818   8243
 27      4958520   2353768.18    21301   8030
 28      5073022   2392817.96    39049   8217
 28      5135610   2414267.76    21449   8032
 29      5250013   2452981.53    38713   8250
 29      5312700   2474212.04    21230   8098
 30      5427418   2512822.70    38610   8251
 30      5489790   2533517.18    20694   8083
 31      5604738   2572000.05    38482   8291
 31      5666880   2592500.56    20500   8071
 32      5781993   2630529.82    38029   8278
 32      5843970   2651013.32    20483   8080
 33      5959305   2689165.69    38152   8269
 33      6021060   2709293.04    20127   8086
 34      6136501   2746749.93    37456   8278
 34      6198150   2766661.32    19911   8105
 35      6313772   2803964.79    37303   8262
 35      6375240   2823781.55    19816   8095
 36      6490354   2860641.99    36860   8235
 36      6552330   2880476.80    19834   8056
 37      6667255   2917062.19    36585   8231
 37      6729420   2936732.01    19669   8108
 38      6844248   2972992.59    36260   8276
 38      6906510   2992718.12    19725   8074
 39      7021001   3028854.93    36136   8275
 39      7083600   3048416.73    19561   8092
 40      7199191   3084584.77    36168   8304
 40      7260690   3103795.33    19210   8085
 41      7375380   3139675.02    35879   8276
 41      7437780   3158848.65    19173   8136
 42      7552544   3194214.46    35365   8276
 42      7614870   3213532.30    19317   8072
 43      7730130   3248964.60    35432   7724
 43      7791960   3267930.77    18966   8096
 44      7906035   3302653.15    34722   8263
 44      7969050   3321900.34    19247   8103
 45      8083488   3356590.95    34690   8251
 45      8146140   3375484.65    18893   8084
 46      8261612   3410429.05    34944   8228
 46      8323230   3429043.30    18614   8121
 47      8437925   3463440.15    34396   8268
 47      8500320   3482173.21    18733   7931
 48      8615136   3516342.36    34169   8294
 48      8677410   3534982.44    18640   8098
 49      8791889   3568641.36    33658   8265
 49      8854500   3587280.92    18639   8102
⚠ Skipped 250 empty values
✔ Successfully finished pretrain

I have chosen to use the default parameters however one might need to change them for their problem.

We can see from the logs that the loss value in the last iteration is 18639, but since the batch_size was 3000 our data must have splitted to 2 batches, (number of texts are 4622) we should also take the previous log entry to account which is loss of 33658, So the average of them would be 26148.5, This number might be intimidating but the only way to check if it actually helps is to try to train a model with it.

If it doesn't then we can resume the training from the model saved on the last epoch. 

We keep only the last model from the pretraining.

Let's train a text classifier with Spacy

Text classifier with Spacy

Now that we have a pretrained model, We now need to prepare data for training the text classifier. Let's have a look at the data format that Spacy expects the data to be in.

Data Generation

{
   "entities": [(0, 4, "ORG")],
   "heads": [1, 1, 1, 5, 5, 2, 7, 5],
   "deps": ["nsubj", "ROOT", "prt", "quantmod", "compound", "pobj", "det", "npadvmod"],
   "tags": ["PROPN", "VERB", "ADP", "SYM", "NUM", "NUM", "DET", "NOUN"],
   "cats": {"BUSINESS": 1.0},
}

This format works for training via code, as given in the examples above, There is also another format mentioned here

cats is the only part we need to worry about, this must be where they look for categories/classes.

We have three classes in our dataset

  • 0 for Neutral
  • 1 for Positive
  • -1 for Negative

and they are mutually-exclusive (There can be only one label for a sentence)

We also need to split the training data we have to training and evaluation sets so that we can see how well our model has learnt the problem.

Let's try to programmatically generate the training data from pandas dataframe

label_map = {1:'POSITIVE', -1:'NEGATIVE', 0:'NEUTRAL'}

We need list of tuples of text and the annotation details in a dictionary as mentioned above.

train_json = [
    # Get the text from dataframe row
    (tweet.original_text,
     {'cats':{
            
            label_map[tweet.sentiment_class]:1.0
            }
        })
    for tweet in train_data[['original_text', 'sentiment_class']].itertuples(index=False, name='Tweet')
]
train_json[0]
("Happy #MothersDay to all you amazing mothers out there! I know it's hard not being able to see your mothers today but it's on all of us to do what we can to protect the most vulnerable members of our society. #BeatCoronaVirus ",
 {'cats': {'NEUTRAL': 1.0}})

Now we will split the training data

from sklearn.model_selection import train_test_split

# Stratified split with labels
train_split, eval_split = train_test_split(train_json, test_size=0.2,
                                           stratify=train_data['sentiment_class'])
len(train_split), len(eval_split)
(2588, 647)

We should save them as json files to give them as input to the command line train utility in spacy.

import json

with Path(DRIVE_PATH/'train_clas.json').open('w') as f:
    json.dump(train_split, f)

with Path(DRIVE_PATH/'eval_clas.json').open('w') as f:
    json.dump(eval_split, f)

Validate data input for spacy

Now should if we have enough data to train the model with train spacy command in CLI, for that I will use Spacy's debug-data command in CLI.

!python -m spacy debug-data -h
usage: spacy debug-data [-h] [-tm None] [-b None] [-p tagger,parser,ner] [-IW]
                        [-V] [-NF]
                        lang train_path dev_path

    Analyze, debug and validate your training and development data, get useful
    stats, and find problems like invalid entity annotations, cyclic
    dependencies, low data labels and more.
    

positional arguments:
  lang                  model language
  train_path            location of JSON-formatted training data
  dev_path              location of JSON-formatted development data

optional arguments:
  -h, --help            show this help message and exit
  -tm None, --tag-map-path None
                        Location of JSON-formatted tag map
  -b None, --base-model None
                        name of model to update (optional)
  -p tagger,parser,ner, --pipeline tagger,parser,ner
                        Comma-separated names of pipeline components to train
  -IW, --ignore-warnings
                        Ignore warnings, only show stats and errors
  -V, --verbose         Print additional information and explanations
  -NF, --no-format      Don't pretty-print the results
%%bash
(python -m spacy debug-data en \
    /content/drive/My\ Drive/Spacy/Pretrained/train_clas.json \
    /content/drive/My\ Drive/Spacy/Pretrained/eval_clas.json \
    -p 'textcat' \
)

=========================== Data format validation ===========================
✔ Corpus is loadable

=============================== Training stats ===============================
Training pipeline: textcat
Starting with blank model 'en'
0 training docs
0 evaluation docs
✘ No evaluation docs
✔ No overlap between training and evaluation data
✘ Low number of examples to train from a blank model (0)

============================== Vocab & Vectors ==============================
ℹ 0 total words in the data (0 unique)
ℹ No word vectors present in the model

============================ Text Classification ============================
ℹ Text Classification: 0 new label(s), 0 existing label(s)
ℹ The train data contains only instances with mutually-exclusive
classes.

================================== Summary ==================================
✔ 2 checks passed
✘ 2 errors

Data Generation (again)

There must be something I missed now, I asked a question on stackoverflow regarding this, turns out we need to get .jsonl format(again) and use the script provided in the repo to convert to the required json format for training, now I need to change the data generation a little bit to do that.

train_jsonl = [
    # Get the text from dataframe row
    {'text': tweet.original_text,
     'cats': {v: 1.0 if tweet.sentiment_class == k else 0.0
              for k, v in label_map.items()},

     'meta':{"id": str(tweet.Index)}
    }
    for tweet in train_data[['original_text', 'sentiment_class']].itertuples(index=True, name='Tweet')
]
train_jsonl[0]
{'cats': {'NEGATIVE': 0.0, 'NEUTRAL': 1.0, 'POSITIVE': 0.0},
 'meta': {'id': '1.24502457848689e+18'},
 'text': "Happy #MothersDay to all you amazing mothers out there! I know it's hard not being able to see your mothers today but it's on all of us to do what we can to protect the most vulnerable members of our society. #BeatCoronaVirus "}

So instead of a list of tuples now I have a list of dictionaries. We need to split again to have an evaluation set

train_split, eval_split = train_test_split(train_jsonl, test_size=0.2,
                                           stratify=train_data['sentiment_class'])
len(train_split), len(eval_split)
(2588, 647)

We still need to convert the jsonl to the required json format, now for that I will use the script named textcatjsonl_to_trainjson.py in this repo. Let's download the script from the repo.

!wget -O script.py https://raw.githubusercontent.com/explosion/spaCy/master/examples/training/textcat_example_data/textcatjsonl_to_trainjson.py
--2020-05-30 07:44:50--  https://raw.githubusercontent.com/explosion/spaCy/master/examples/training/textcat_example_data/textcatjsonl_to_trainjson.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1542 (1.5K) [text/plain]
Saving to: ‘script.py’

script.py           100%[===================>]   1.51K  --.-KB/s    in 0s      

2020-05-30 07:44:51 (17.2 MB/s) - ‘script.py’ saved [1542/1542]

%%bash
python script.py -m en /content/drive/My\ Drive/Spacy/train_texts.jsonl /content/drive/My\ Drive/Spacy
python script.py -m en /content/drive/My\ Drive/Spacy/eval_texts.jsonl /content/drive/My\ Drive/Spacy

Let's try to debug again

Validate (again)


=========================== Data format validation ===========================
✔ Corpus is loadable

=============================== Training stats ===============================
Training pipeline: textcat
Starting with blank model 'en'
2584 training docs
647 evaluation docs
⚠ 5 training examples also in evaluation data

============================== Vocab & Vectors ==============================
ℹ 98859 total words in the data (10688 unique)
ℹ No word vectors present in the model

============================ Text Classification ============================
ℹ Text Classification: 3 new label(s), 0 existing label(s)
ℹ The train data contains only instances with mutually-exclusive
classes.

================================== Summary ==================================
✔ 1 check passed
⚠ 1 warning

It worked !, Thanks to the answerer of this question, now we know that our data format is correct. Turns out there is another command to convert our files to spacy's JSON format which is mentioned here.

The output is pointing out that the evaluation set has some data leakage. I will try to remove that now.

new_eval = [annot for annot in eval_split if all([annot['text'] != t['text'] for t in train_split])]
len(new_eval), len(eval_split)
(641, 647)

We thought there were 5 samples leaking into the training data, it is six here, anyway let's try to validate the data again.

%%bash
(python -m spacy debug-data en \
    /content/drive/My\ Drive/Spacy/train_texts.json \
    /content/drive/My\ Drive/Spacy/eval_texts.json \
    -p 'textcat' \
)

=========================== Data format validation ===========================
✔ Corpus is loadable

=============================== Training stats ===============================
Training pipeline: textcat
Starting with blank model 'en'
2584 training docs
641 evaluation docs
✔ No overlap between training and evaluation data

============================== Vocab & Vectors ==============================
ℹ 98859 total words in the data (10688 unique)
ℹ No word vectors present in the model

============================ Text Classification ============================
ℹ Text Classification: 3 new label(s), 0 existing label(s)
ℹ The train data contains only instances with mutually-exclusive
classes.

================================== Summary ==================================
✔ 2 checks passed

We are all set to start training now!

Classifier Training

I have made the command to train in CLI, Please refer the comments for details in the order of the arguments given here

%%bash
## Arguement info
# Language of text in which the Model is going to be trained
# Path to store model
# Training data json path
# Evaluation data json path
# Pipeline components that we are going to train
# Number of iterations in total
# Nummber of iterations to wait before improvement in eval accuracy 
# Pretrained model to start with
# version
# Augmentation for data(2 params)
# Model Architecture for text classifier (cnn + bow)
(python -m spacy train \
    en \
    -b en_core_web_sm \
    /content/drive/My\ Drive/Spacy/Classifier \
    /content/drive/My\ Drive/Spacy/train_texts.json \
    /content/drive/My\ Drive/Spacy/train_texts.json \
    -p "textcat" \
    -n 100 \
    -ne 10 \
    -t2v /content/drive/My\ Drive/Spacy/Pretrained/fifty_iter/model49.bin \
    -V 0.1 \
    -nl 0.1 \
    -ovl 0.1)
Training pipeline: ['textcat']
Starting with base model 'en_core_web_sm'
Adding component to base model 'textcat'
Counting training words (limit=0)
Loaded pretrained tok2vec for: []
Textcat evaluation score: F1-score macro-averaged across the labels 'POSITIVE,
NEGATIVE, NEUTRAL'

Itn  Textcat Loss  Textcat  Token %  CPU WPS
---  ------------  -------  -------  -------
  1        26.738   39.853  100.000   177034
  2         5.179   65.120  100.000   157933
  3         1.483   76.615  100.000   178008
  4         0.686   83.266  100.000   177567
  5         0.288   86.236  100.000   169033
  6         0.151   88.381  100.000   176679
  7         0.090   90.099  100.000   166485
  8         0.057   91.000  100.000   171279
  9         0.135   92.472  100.000   175907
 10         0.028   93.237  100.000   171838
 11         0.023   94.147  100.000   175174
 12         0.022   94.729  100.000   155840
 13         0.021   95.248  100.000   161975
 14         0.021   95.485  100.000   168029
 15         0.019   95.980  100.000   161440
 16         0.018   96.226  100.000   167550
 17         0.019   96.713  100.000   172607
 18         0.017   96.849  100.000   169682
 19         0.017   97.026  100.000   167330
 20         0.015   97.299  100.000   173145
 21         0.016   97.405  100.000   173020
 22         0.015   97.526  100.000   165310
 23         0.014   97.704  100.000   165994
 24         0.013   97.865  100.000   176089
 25         0.013   98.106  100.000   172153
 26         0.013   98.201  100.000   172878
 27         0.013   98.241  100.000   175909
 28         0.012   98.320  100.000   170099
 29         0.013   98.400  100.000   175274
 30         0.012   98.481  100.000   170135
 31         0.012   98.521  100.000   164726
 32         0.011   98.536  100.000   171204
 33         0.011   98.536  100.000   163467
 34         0.011   98.576  100.000   150728
 35         0.011   98.696  100.000   172780
 36         0.010   98.735  100.000   163459
 37         0.010   98.695  100.000   162075
 38         0.010   98.750  100.000   165827
 39         0.010   98.790  100.000   165852
 40         0.010   98.830  100.000   174490
 41         0.009   98.870  100.000   165485
 42         0.010   98.990  100.000   164896
 43         0.009   99.045  100.000   172563
 44         0.008   99.045  100.000   169908
 45         0.009   99.005  100.000   152600
 46         0.008   99.084  100.000   166329
 47         0.009   99.084  100.000   173841
 48         0.008   99.164  100.000   163433
 49         0.008   99.203  100.000   162648
 50         0.008   99.258  100.000   177108
 51         0.009   99.298  100.000   173468
 52         0.008   99.298  100.000   169904
 53         0.008   99.298  100.000   171979
 54         0.008   99.298  100.000   166437
 55         0.008   99.298  100.000   170520
 56         0.007   99.337  100.000   172712
 57         0.007   99.337  100.000   174966
 58         0.007   99.392  100.000   173173
 59         0.008   99.392  100.000   173910
 60         0.007   99.392  100.000   169447
 61         0.007   99.431  100.000   161931
 62         0.007   99.471  100.000   106123
 63         0.007   99.471  100.000   177625
 64         0.007   99.511  100.000   172946
 65         0.007   99.511  100.000   173579
 66         0.007   99.511  100.000   172204
 67         0.007   99.550  100.000   172994
 68         0.006   99.550  100.000   174403
 69         0.007   99.590  100.000   173900
 70         0.006   99.630  100.000   169824
 71         0.007   99.630  100.000   171172
 72         0.006   99.669  100.000   172633
 73         0.006   99.669  100.000   159052
 74         0.007   99.669  100.000   174377
 75         0.007   99.669  100.000   163376
 76         0.006   99.669  100.000   174366
 77         0.007   99.669  100.000   175517
 78         0.007   99.669  100.000   175583
 79         0.006   99.669  100.000   174024
 80         0.006   99.669  100.000   174381
 81         0.006   99.669  100.000   177120
 82         0.006   99.708  100.000   175032
 83         0.006   99.708  100.000   173298
 84         0.007   99.708  100.000   171622
 85         0.006   99.709  100.000   163705
 86         0.006   99.709  100.000   175330
 87         0.006   99.709  100.000   178355
 88         0.006   99.709  100.000   170868
 89         0.006   99.709  100.000   164401
 90         0.005   99.709  100.000   173884
 91         0.006   99.709  100.000   159754
 92         0.006   99.709  100.000   177335
 93         0.006   99.709  100.000   169868
 94         0.006   99.709  100.000   168164
 95         0.005   99.709  100.000   151894
 96         0.006   99.709  100.000   171580
 97         0.005   99.709  100.000   169471
 98         0.005   99.724  100.000   156458
 99         0.005   99.724  100.000   168167
100         0.006   99.724  100.000   172201
✔ Saved model to output directory
/content/drive/My Drive/Spacy/Classifier/model-final
✔ Created best model
/content/drive/My Drive/Spacy/Classifier/model-best
                                                        

I also tried to train without the pretrained model (ie)en_core_web_sm, The logs for that are here below. (Uncollapse to view), the results are not very different, the evaluation metrics are off the roof. We need to predict the test data and try to submit to the competition for a better picture of the model.

%%bash
## Arguement info
# Language of text in which the Model is going to be trained
# Path to store model
# Training data json path
# Evaluation data json path
# Pipeline components that we are going to train
# Number of iterations in total
# Nummber of iterations to wait before improvement in eval accuracy 
# Pretrained model to start with
# version
# Augmentation for data(2 params)
# Model Architecture for text classifier (cnn + bow)
(python -m spacy train \
    en \
    /content/drive/My\ Drive/Spacy/Classifier_without_using_websm \
    /content/drive/My\ Drive/Spacy/train_texts.json \
    /content/drive/My\ Drive/Spacy/train_texts.json \
    -p "textcat" \
    -n 100 \
    -ne 10 \
    -t2v /content/drive/My\ Drive/Spacy/Pretrained/fifty_iter/model49.bin \
    -V 0.1 \
    -nl 0.1 \
    -ovl 0.1)
✔ Created output directory: /content/drive/My
Drive/Spacy/Classifier_without_using_websm
Training pipeline: ['textcat']
Starting with blank model 'en'
Counting training words (limit=0)
Loaded pretrained tok2vec for: []
Textcat evaluation score: F1-score macro-averaged across the labels 'POSITIVE,
NEGATIVE, NEUTRAL'

Itn  Textcat Loss  Textcat  Token %  CPU WPS
---  ------------  -------  -------  -------
  1        26.755   40.980  100.000   166278
  2         5.293   65.846  100.000   172083
  3         1.506   76.992  100.000   175595
  4         0.695   83.314  100.000   173543
  5         0.293   86.284  100.000   172609
  6         0.156   88.784  100.000   171486
  7         0.091   90.136  100.000   161118
  8         0.056   91.761  100.000   156752
  9         0.112   92.442  100.000   167948
 10         0.028   93.329  100.000   162446
 11         0.024   94.144  100.000   165753
 12         0.022   95.206  100.000   168336
 13         0.021   95.769  100.000   161408
 14         0.020   96.150  100.000   162562
 15         0.019   96.474  100.000   163309
 16         0.018   96.775  100.000   168399
 17         0.018   97.140  100.000   169412
 18         0.017   97.357  100.000   171364
 19         0.017   97.503  100.000   172552
 20         0.016   97.584  100.000   167923
 21         0.016   97.678  100.000   168228
 22         0.015   97.934  100.000   158830
 23         0.014   98.055  100.000   170587
 24         0.013   98.216  100.000   161772
 25         0.014   98.256  100.000   160948
 26         0.013   98.296  100.000   163401
 27         0.013   98.391  100.000   168392
 28         0.012   98.351  100.000   162147
 29         0.012   98.391  100.000   171460
 30         0.012   98.511  100.000   171279
 31         0.012   98.511  100.000   161304
 32         0.011   98.511  100.000   171576
 33         0.011   98.511  100.000   171248
 34         0.011   98.591  100.000   166902
 35         0.010   98.710  100.000   164750
 36         0.011   98.830  100.000   164097
 37         0.011   98.790  100.000   170317
 38         0.010   98.790  100.000   163521
 39         0.010   98.830  100.000   162378
 40         0.009   98.964  100.000   164281
 41         0.009   98.964  100.000   173645
 42         0.011   99.004  100.000   165681
 43         0.009   99.044  100.000   165916
 44         0.009   99.044  100.000   168503
 45         0.008   99.044  100.000   166608
 46         0.008   99.123  100.000   170394
 47         0.009   99.084  100.000   171932
 48         0.008   99.124  100.000   172888
 49         0.009   99.084  100.000   169469
 50         0.009   99.084  100.000   167170
 51         0.008   99.084  100.000   169762
 52         0.008   99.124  100.000   166178
 53         0.008   99.124  100.000   161415
 54         0.008   99.164  100.000   164241
 55         0.008   99.164  100.000   172629
 56         0.008   99.164  100.000   164923
 57         0.008   99.243  100.000   160153
 58         0.007   99.243  100.000   171699
 59         0.007   99.283  100.000   165604
 60         0.008   99.323  100.000   161672
 61         0.007   99.362  100.000   157016
 62         0.007   99.417  100.000   171005
 63         0.007   99.417  100.000   168709
 64         0.007   99.417  100.000   170886
 65         0.007   99.417  100.000   164144
 66         0.007   99.417  100.000   154789
 67         0.007   99.417  100.000   162214
 68         0.006   99.457  100.000   164467
 69         0.006   99.457  100.000   169052
 70         0.006   99.496  100.000   168125
 71         0.007   99.496  100.000   164085
 72         0.006   99.575  100.000   163078
 73         0.006   99.575  100.000   162955
 74         0.006   99.575  100.000   166206
 75         0.007   99.575  100.000   164477
 76         0.006   99.575  100.000   169814
 77         0.006   99.575  100.000   162547
 78         0.006   99.575  100.000   168980
 79         0.007   99.575  100.000   172534
 80         0.006   99.575  100.000   161797
 81         0.007   99.575  100.000   162510
 82         0.006   99.575  100.000   172787
 83         0.005   99.535  100.000   159187
 84         0.006   99.535  100.000   168200
 85         0.005   99.614  100.000   167757
 86         0.006   99.614  100.000   158842
 87         0.006   99.654  100.000   166849
 88         0.005   99.654  100.000   162507
 89         0.006   99.654  100.000   167156
 90         0.005   99.654  100.000    97872
 91         0.006   99.654  100.000   162397
 92         0.006   99.708  100.000   168693
 93         0.005   99.708  100.000   167645
 94         0.005   99.708  100.000   163485
 95         0.006   99.708  100.000   171732
 96         0.005   99.708  100.000   165686
 97         0.005   99.708  100.000   167604
 98         0.005   99.708  100.000   166435
 99         0.005   99.708  100.000   161645
100         0.005   99.708  100.000   171467
✔ Saved model to output directory
/content/drive/My Drive/Spacy/Classifier_without_using_websm/model-final
✔ Created best model
/content/drive/My Drive/Spacy/Classifier_without_using_websm/model-best
                                                        

Prediction on test data

test_data = pd.read_csv(DATA_PATH/'test.csv', index_col=0)
test_data.head()
original_text lang retweet_count original_author
id
1.246628e+18 3. Yeah, I once cooked potatoes when I was 3 y... en 0 LToddWood
1.245898e+18 Happy Mother's Day to all the mums, step-mums,... en 0 iiarushii
1.244717e+18 I love the people from the UK, however, when I... en 0 andreaanderegg
1.245730e+18 Happy 81st Birthday Happy Mother’s Day to my m... en 1 TheBookTweeters
1.244636e+18 Happy Mothers day to all those wonderful mothe... en 0 andreaanderegg

Clean test data

We will clean the test data of links with regex as well.

test_data['original_text'].replace(
    # Regex pattern to match : the text to replace with
    {'(https?:\/\/.*|pic.*)[\r\n]*' : ''},
    regex=True, inplace=True)
test_data.shape
(1387, 4)
list_of_test_texts = test_data['original_text'].tolist()

Let's load the Spacy model from our training

import spacy
textcat_mod = spacy.load(DRIVE_PATH.parent/'Classifier/model-best')

I will try to fasten the prediction by using multithreading as mentioned here

d = textcat_mod(list_of_test_texts[0])
d.cats
{'NEGATIVE': 0.020245620980858803,
 'NEUTRAL': 0.9727445840835571,
 'POSITIVE': 0.007009787950664759}
max(d.cats, key=lambda x: d.cats[x])
'NEUTRAL'
label_map = {'POSITIVE':1, 'NEGATIVE':-1, 'NEUTRAL':0}
preds = []

for doc in textcat_mod.pipe(list_of_test_texts, n_threads=4, batch_size=100):
    pred_cls = max(doc.cats, key=lambda x: doc.cats[x])
    preds.append(label_map[pred_cls])
len(preds), len(list_of_test_texts)
(1387, 1387)

Let's form the submission

sub_df = pd.DataFrame(
     preds,
     index=test_data.index,
     columns=['sentiment_class']
)
sub_df.shape
(1387, 1)
sub_df.head()
sentiment_class
id
1.246628e+18 0
1.245898e+18 0
1.244717e+18 -1
1.245730e+18 -1
1.244636e+18 0
sub_df.to_csv(DRIVE_PATH.parent/'submission.csv')

The submitted predictions scored a mere 39/100 in weighted f1-score, that's disappointing. -_-

Let's analyze the predictions

Prediction distribution

sub_df['sentiment_class'].value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7fed20ed1b70>
sub_df['sentiment_class'].value_counts()
 0    847
 1    277
-1    263
Name: sentiment_class, dtype: int64

This looks very similar to the train data

train_data['sentiment_class'].value_counts()
 0    1701
-1     769
 1     765
Name: sentiment_class, dtype: int64

What would have gone wrong?, I guess what I can do is try another method(traditional). Coming up in another post