Introduction

In this post, I will try to model தமிழ் (Tamil), I have already prepared the data for the language model in the kaggle notebook here, A language model will be useful for many tasks such as text classification, information retrieval etc.

What is a language model?

From what I know, Language model is a machine's way of understanding a language, technically it is defined as a probability distribution over a sequence of words[^1], by helping the machine to understand language, we can use them text-based classifiers, chatbots and for other NLP tasks in that language.

How do we train a language model?

I recall the times when you was going for school, I was given language lessons for English and தமிழ், The languages were different in grammar, dialects, sounds etc., after some lessons about the words and letters present in the language, both of them were taught to all in the same manner.

We would have lessons in textbooks, of which most of them are stories, biographies and history. Most of the exercises at the end of each lesson are

Fill in the blanks like The ____ rises in the east. or சூரியன் உதிக்கும் திசை _____
Write short descriptive answers for question based on the lesson.
Maybe longer essays on general What-if scenarios from the lesson.

We know that to answer them, it required a decent understanding of the language's grammar, which in turn, is taught to the children by making them read and write the questions and answers.

Now how can we teach a machine to understand and learn the language?, We had textbooks to read and learn about the language, So the machine also needs data, like our textbooks (but not necessarily the same ones with which we learn) to learn the language.

So how can we ensure that the machine is learning properly?, We test it with Fill in the blanks kind of exercises and let it guess the next possible word for the sequence of words we give it. It will not be easy for the machine but a little bit of learning is enough to use the model for other purposes.

We cannot directly give the raw text data to the model, We need to convert them to sequence of words to help it learn the flow of the words

Things we need for a language model

A decent amount of raw text data, more about that here
A language tokenizer, more about that here and here

This notebook is executed on kaggle, so the paths mentioned here will be needed to change if you run in your own environment.

Setup libraries and paths

Set seed for reproducibility

Load text data from csv

LANG_DATA = pd.read_csv(DATA_PATH/'filtered_data.csv.tar.gz', index_col=[0])
LANG_DATA.head()

We have the url, article_id and title as additional information about the text, Let's check the average length of the article text.

Exploration

The total articles we have are 131162

Remove empty articles from the dataframe

LANG_DATA.dropna(axis='rows', inplace=True)
LANG_DATA.info()

Length of articles

The average length of each article is 1370 words

We had one empty article, I suppose.

Prepare Text data for Language model

processor = SPProcessor(lang='ta', sp_model=DATA_PATH/'tamil_tok.model', sp_vocab=DATA_PATH/'tamil_tok.vocab')

Set batch size

bs = 16

data_lm = (TextList.from_df(LANG_DATA, path=DATA_PATH, cols='text', processor=processor)
            # Split out some data in a random fashion for testing
            .split_by_rand_pct(0.1)
            # This is where we convert the raw text data to 
            # sequences of words
            .label_for_lm()
            # We want to do a language model so we label accordingly
            .databunch(bs=bs))

data_lm.sanity_check()

Let's save the language model data to skip the processing above next time.

data_lm.save(DATA_PATH/'data_lm.pkl')

Let's have a look at the tokenized data from the sentencepiece tokenizer.

data_lm.show_batch()

bos means beginning of the sentence.
eos means end of the sentence.
xx maj used to indicate the next word begins with a capital in the original text. more about this can be found here and here

Train the language model

Initialize model

config = awd_lstm_lm_config.copy()
config['qrnn'] = True
config['n_hid'] = 1550
config['n_layers'] = 4

# This is a classification metric,
# determines how well can the model
# narrow the choice of words from it's 
# vocabulary for the next prediction.
perplexity = Perplexity()

learn = language_model_learner(data_lm, arch=AWD_LSTM, config=config,
                               pretrained=False,
                                metrics=[accuracy, perplexity],
                              ).to_fp16()
# gradient clipping
learn.clip = 0.1
learn.model_dir=DATA_PATH

Find proper learning rate

learn.lr_find()

learn.recorder.plot(suggestion=True)

Get suggested learning rate

min_grad_lr = learn.recorder.min_grad_lr
min_grad_lr

Start training

Stage - 1

learn.fit_one_cycle(10, min_grad_lr,
                    # Momentums, just a try!
                    div_factor=10.0, pct_start=0.8, moms=(0.75,0.65),
                    callbacks=[SaveModelCallback(learn, every='improvement', monitor='perplexity', name='best_st1'),
                               CSVLogger(learn, filename=DATA_PATH/'history', append=True)])

Save the intermediate results

learn.load('best_st1');
learn.save('ta-wiki-stage1')
learn.save_encoder('ta-wiki-enc-stage1')

You can chop and change the parameters, to get a better model, find the latest run of the notebook on kaggle, please upvote there if you liked this.