Building தமிழ் language model
In this notebook I try to build a language model for தமிழ்(Tamil) to help in some basic NLP tasks
In this post, I will try to model தமிழ்
(Tamil), I have already prepared the data for the language model in the kaggle notebook here, A language model will be useful for many tasks such as text classification, information retrieval etc.
From what I know, Language model is a machine's way of understanding
a language, technically it is defined as a probability distribution over a sequence of words[^1], by helping the machine to understand language, we can use them text-based classifiers, chatbots and for other NLP tasks in that language.
I recall the times when you was going for school, I was given language lessons for English
and தமிழ்
, The languages were different in grammar, dialects, sounds etc., after some lessons about the words and letters present in the language, both of them were taught to all in the same manner.
We would have lessons in textbooks, of which most of them are stories, biographies and history. Most of the exercises at the end of each lesson are
- Fill in the blanks like
The ____ rises in the east.
orசூரியன் உதிக்கும் திசை _____
- Write short descriptive answers for question based on the lesson.
- Maybe longer essays on general
What-if
scenarios from the lesson.
We know that to answer them, it required a decent understanding of the language's grammar, which in turn, is taught to the children by making them read and write the questions and answers.
Now how can we teach a machine to understand and learn the language?, We had textbooks to read and learn about the language, So the machine also needs data, like our textbooks (but not necessarily the same ones with which we learn) to learn the language.
So how can we ensure that the machine is learning properly?, We test it with Fill in the blanks
kind of exercises and let it guess the next possible word for the sequence of words we give it. It will not be easy for the machine but a little bit of learning is enough to use the model for other purposes.
We cannot directly give the raw text data to the model, We need to convert them to sequence of words to help it learn the flow of the words
LANG_DATA = pd.read_csv(DATA_PATH/'filtered_data.csv.tar.gz', index_col=[0])
LANG_DATA.head()
We have the url
, article_id
and title
as additional information about the text, Let's check the average length of the article text.
The total articles we have are 131162
LANG_DATA.dropna(axis='rows', inplace=True)
LANG_DATA.info()
The average length of each article is 1370
words
We had one empty article, I suppose.
processor = SPProcessor(lang='ta', sp_model=DATA_PATH/'tamil_tok.model', sp_vocab=DATA_PATH/'tamil_tok.vocab')
Set batch size
bs = 16
data_lm = (TextList.from_df(LANG_DATA, path=DATA_PATH, cols='text', processor=processor)
# Split out some data in a random fashion for testing
.split_by_rand_pct(0.1)
# This is where we convert the raw text data to
# sequences of words
.label_for_lm()
# We want to do a language model so we label accordingly
.databunch(bs=bs))
data_lm.sanity_check()
Let's save the language model data to skip the processing above next time.
data_lm.save(DATA_PATH/'data_lm.pkl')
Let's have a look at the tokenized data from the sentencepiece
tokenizer.
data_lm.show_batch()
Initialize model
config = awd_lstm_lm_config.copy()
config['qrnn'] = True
config['n_hid'] = 1550
config['n_layers'] = 4
# This is a classification metric,
# determines how well can the model
# narrow the choice of words from it's
# vocabulary for the next prediction.
perplexity = Perplexity()
learn = language_model_learner(data_lm, arch=AWD_LSTM, config=config,
pretrained=False,
metrics=[accuracy, perplexity],
).to_fp16()
# gradient clipping
learn.clip = 0.1
learn.model_dir=DATA_PATH
learn.lr_find()
learn.recorder.plot(suggestion=True)
Get suggested learning rate
min_grad_lr = learn.recorder.min_grad_lr
min_grad_lr
learn.fit_one_cycle(10, min_grad_lr,
# Momentums, just a try!
div_factor=10.0, pct_start=0.8, moms=(0.75,0.65),
callbacks=[SaveModelCallback(learn, every='improvement', monitor='perplexity', name='best_st1'),
CSVLogger(learn, filename=DATA_PATH/'history', append=True)])
learn.load('best_st1');
learn.save('ta-wiki-stage1')
learn.save_encoder('ta-wiki-enc-stage1')
You can chop and change the parameters, to get a better model, find the latest run of the notebook on kaggle, please upvote there if you liked this.