You can find the blog post regarding extraction here and kaggle notebook with output here

Import required libraries

from pathlib import Path

import sentencepiece as spm
import pandas as pd

Read data from csv

lang_data = pd.read_csv('../input/tamil-wiki-data-extraction/filtered_data.csv.tar.gz', index_col=[0])
lang_data.head()

lang_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 133412 entries, 0 to 133411
Data columns (total 4 columns):
id       133412 non-null int64
url      133412 non-null object
title    133412 non-null object
text     133412 non-null object
dtypes: int64(1), object(3)
memory usage: 5.1+ MB

Setup paths

OUTPUT_DIR = Path('/kaggle/working')
TEXTS_DIR = OUTPUT_DIR/'texts'
TOK_DIR = OUTPUT_DIR/'tokenizer'

# Create directories
TOK_DIR.mkdir()
TEXTS_DIR.mkdir()

Prepare texts

We can pass a list of files as a comma seperated string according to documentation, So we can store each article in a text file and pass the names in a comma seperated string.

for t in lang_data.itertuples():
    file_name = Path(TEXTS_DIR/f'text_{t.Index}.txt')
    file_name.touch()
    with file_name.open('w') as f:
        f.write(t.text)

len([t for t in TEXTS_DIR.iterdir()]), lang_data.shape[0]

(133412, 133412)

All the files have been converted to texts

Train sentencepiece model

Let's make a comma seperated string of filenames

files = ','.join([str(t) for t in TEXTS_DIR.iterdir()])
files[:100]

'/kaggle/working/texts/text_40902.txt,/kaggle/working/texts/text_44212.txt,/kaggle/working/texts/text'

We must find the right vocab_size for the tokenizer, that can be done only by testing the tokenizer after building onw

for v in 8000, 16000, 20000, 30000:
    api_str = f"""--input={files} --vocab_size={v} --model_type=unigram --character_coverage=0.9995 --model_prefix={str(TOK_DIR)}/tok_{v}_size --max_sentence_length=20000"""
    print("Training with vocab set as:", v)
    spm.SentencePieceTrainer.train(api_str)

Training with vocab set as: 8000
Training with vocab set as: 16000
Training with vocab set as: 20000
Training with vocab set as: 30000

Cleanup

!rm -rf /kaggle/working/texts/

Let's test the models in another notebook, you can find the outputs in this kaggle notebook

	id	url	title	text
0	48482	https://ta.wikipedia.org/wiki?curid=48482	தென் துருவம்	தென் துருவம் தென் முனை தென் துருவம் என்பது புவ...
1	48485	https://ta.wikipedia.org/wiki?curid=48485	ஆர்க்டிக் வட்டம்	ஆர்க்டிக் வட்டம் ஆர்க்டிக் வட்டம் என்பது ஐந்து...
2	48486	https://ta.wikipedia.org/wiki?curid=48486	நாஞ்சில் நாடன்	நாஞ்சில் நாடன் நாஞ்சில் நாடன் பிறப்பு திசம்பர்...
3	48492	https://ta.wikipedia.org/wiki?curid=48492	டிக்கோயா	டிக்கோயா டிக்கோயா இலங்கையின் மத்திய மாகாணத்தின...
4	48493	https://ta.wikipedia.org/wiki?curid=48493	நள்ளிரவுச் சூரியன்	நள்ளிரவுச் சூரியன் நள்ளிரவுச் சூரியன் அல்லது த...