Text mining with chemical text from chemdNER
This is a take home project that I did for a company as part of their interview process.
- Prepare data
- Install required libraries
- Import annotations
- Make raw text files for pretraining
- Make json file for training
- Make json file for evaluation and development
- Pretrain spacy with texts
- Debug the training data for spacy NER model
- Train model
- Evaluate model with test data
- Misc
This blog post has the code and details, of the take home project that I did for a company as part of their interview process, this primarily uses the CHEMDNER data to create a named recognition model using spaCy
The CHEMDNER corpus of chemicals and drugs and its annotation principles
- Krallinger, M. et al. The CHEMDNER corpus of chemicals and drugs and its annotation principles. J Cheminform, 2014
Get data and make create directories to organise information.
!wget --no-check-certificate https://biocreative.bioinformatics.udel.edu/media/store/files/2014/chemdner_corpus.tar.gz
!mkdir /content/data
!tar -xvf /content/drive/MyDrive/chemdNER\ data/data.tar.gz -C /content/data
Look at the libraries currently installed.
!pip list | grep spacy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_lg-0.3.0.tar.gz
Since we are training data with medical text a model with similar vectors will be extremely useful, using the scispacy model vectors.
!pip list| grep en-core-sci-lg
import pandas as pd
from pathlib import Path
annot_path = Path('/content/data/chemdner_corpus')
export_path = Path('/content/data/spacy_annotation_data')
export_path.mkdir(exist_ok=True)
train_annot = pd.read_csv(annot_path/'training.annotations.txt', sep='\t', names=['PMID', 'Text Type', 'Start', 'End', 'Entity Text', 'Entity'])
train_annot.head()
Since the training text can be in the title or in the abstract, we should be careful on what text we are giving in an annotation.
Let's check in the annotation for one value
start, end = train_annot.loc[train_annot['PMID'] == 21826085, ['Start', 'End']].values.tolist()[0]
start, end # start, end indices for haloperidol
train_text = pd.read_csv(annot_path/'training.abstracts.txt', sep='\t', names=['PMID', 'Title', 'Abstract'])
train_text.head()
train_annot['Text Type'].value_counts()
train_text.loc[train_text['PMID'] == 21826085, 'Abstract'].item()[start:end]
We can see that the annotation data will seperate for the Title and Abstract pieces of the text.
import srsly
dev_text = pd.read_csv(annot_path/'development.abstracts.txt', sep='\t', names=['PMID', 'Title', 'Abstract'])
eval_text = pd.read_csv(annot_path/'evaluation.abstracts.txt', sep='\t', names=['PMID', 'Title', 'Abstract'])
texts_path = export_path/'texts'
texts_path.mkdir(exist_ok=True)
def make_jsonl(df:pd.DataFrame):
jsonl_txts = []
for data in df.itertuples(index=False):
jsonl_txts.append({'text':data.Title+'\n'+data.Abstract})
return jsonl_txts
train_txts = make_jsonl(train_text)
dev_txts = make_jsonl(dev_text)
eval_txts = make_jsonl(eval_text)
all_txts = train_txts + eval_txts + dev_txts
srsly.write_jsonl(str(texts_path/'pretrain_txts.jsonl'), train_txts[:1500])
# set pmid as the index
train_annot.set_index('PMID', drop=True, inplace=True)
train_text.set_index('PMID', drop=True, inplace=True)
train_data = train_text.join(train_annot)
train_data.head()
Let's check the number of annotations for each label.
train_data['Entity'].value_counts().plot(kind='bar', title='Number of annotations per label')
We have eight entities and the entity named TRIVIAL
has the most number of annotations
Now we can start generating the annotation from the dataframe, to accomodate the training format that we want we can split the data frame to Title and Abstract annotations, so that we can run the same logic easily, (ie) grouping the texts and getting all the annotations corresponding to that text in one shot.
abstract_annots = train_data.loc[train_data['Text Type'] == 'A', :]
title_annots = train_data.loc[train_data['Text Type'] == 'T', :]
import en_core_sci_lg
from spacy.tokens import Span
from spacy.gold import docs_to_json
import json
# load the pretrained model and other utility functions
# to generate the training data
mod = en_core_sci_lg.load()
# since we are going to fill this manually
# for annotation, disabling some pipeline
# modules may speedup the process
mod.disable_pipes(['ner'])
def extract_spacy_docs(annot_df:pd.DataFrame, ext_from:str) -> list:
# Extracting the doc objects from a pretrained model
# so that the tokenization matches for this domain of
# text that we have
annot_docs = []
# Same texts have multiple annotations so group and take
# one by one
for _, annot_data in annot_df.groupby(ext_from):
# Text for annotation
txt = annot_data.iloc[0][ext_from]
doc = mod(txt)
ent_spans = []
text_ranges = []
for annot in annot_data.itertuples():
start = int(annot.Start)
end = int(annot.End)
# Validating overlapping entity annotations
if start not in text_ranges and end not in text_ranges:
# Make spans with our annotations
span = doc.char_span(int(annot.Start), int(annot.End), label=annot.Entity)
# It may be None for invalid annotations(wrt spaCy)
if span is not None:
ent_spans.append(span)
text_ranges.extend(list(range(start, end+1,)))
# Just for info
else:
print(txt, txt[start:end])
doc.ents = ent_spans
annot_docs.append(doc)
return annot_docs
abstract_annots = extract_spacy_docs(abstract_annots, 'Abstract')
title_annots = extract_spacy_docs(title_annots, 'Title')
train_annot_docs = abstract_annots + title_annots
train_json = docs_to_json(train_annot_docs)
spacy_annot_path = export_path/'annots'
spacy_annot_path.mkdir(exist_ok=True)
srsly.write_json(spacy_annot_path/'train_data.json', [train_json])
We must do the same for the development and evaluation data
Let's make a function out of the process we did for the training data, for the evaluation and development data.
dev_annots = pd.read_csv(annot_path/'development.annotations.txt', sep='\t',
names=['PMID', 'Text Type', 'Start', 'End', 'Entity Text', 'Entity'])
eval_annots = pd.read_csv(annot_path/'evaluation.annotations.txt', sep='\t',
names=['PMID', 'Text Type', 'Start', 'End', 'Entity Text', 'Entity'])
def prepare_df_for_spacy_annots(df1: pd.DataFrame, df2: pd.DataFrame) -> 'Abstract and Title dataframes':
# Emulates the processing we did for training data
df1.set_index('PMID', drop=True, inplace=True)
df2.set_index('PMID', drop=True, inplace=True)
df = df1.join(df2)
abstract_annots = df.loc[df['Text Type'] == 'A', :]
title_annots = df.loc[df['Text Type'] == 'T', :]
return abstract_annots, title_annots
def prepare_spacy_json_from_df(df1: pd.DataFrame, df2: pd.DataFrame):
# transform the docs to spacy json
abstract_annots, title_annots = prepare_df_for_spacy_annots(df1, df2)
abstract_docs = extract_spacy_docs(abstract_annots, 'Abstract')
title_docs = extract_spacy_docs(title_annots, 'Title')
annot_docs = abstract_docs + title_docs
spacy_json = docs_to_json(train_annot_docs)
return spacy_json
dev_json = prepare_spacy_json_from_df(dev_text, dev_annots)
eval_json = prepare_spacy_json_from_df(eval_text, eval_annots)
srsly.write_json(spacy_annot_path/'eval_data.json', [eval_json])
srsly.write_json(spacy_annot_path/'dev_data.json', [dev_json])
We now have all the files required to make the model, let's start by pretraining the model with spacy cli
We have imported scispacy's
model vectors(en_core_sci_lg
) to give us a headstart
!python -m spacy pretrain /content/data/spacy_annotation_data/texts/pretrain_txts.jsonl /usr/local/lib/python3.6/dist-packages/en_core_sci_lg/en_core_sci_lg-0.3.0 /content/data/pretrained_model -i 50 -se 10 -s 0 -bs 64
Let's see if the data is proper for training using spacy's debug-data
command
%%bash
python -m spacy debug-data en\
/content/drive/MyDrive/chemdNER\ data/annot_data/train_data.json \
/content/drive/MyDrive/chemdNER\ data/annot_data/dev_data.json \
-p 'ner' \
-b /usr/local/lib/python3.6/dist-packages/en_core_sci_lg/en_core_sci_lg-0.3.0
We have low annotations for NO CLASS
but since we do not have more data, we have to work with what we've got.
We only had 4884
pieces of text from the dataset, and it seems to be used in all the places (train, dev, eval), but the annotation in each will differ.
!python -m spacy link /usr/local/lib/python3.6/dist-packages/en_core_sci_lg/en_core_sci_lg-0.3.0 en_core_sci_lg
I tried to train with the pretrained model which we made using raw text but it was not working and it seems most of the tutorials use it only text classification problems.
But anyway I start with the vectors of scispacy's
pretrained en_core_sci_lg
model for a headstart.
%%bash
python -m spacy train \
en \
/content/data/NER\ model \
/content/drive/MyDrive/chemdNER\ data/annot_data/train_data.json \
/content/drive/MyDrive/chemdNER\ data/annot_data/dev_data.json \
-v /usr/local/lib/python3.6/dist-packages/en_core_sci_lg/en_core_sci_lg-0.3.0 \
-p "ner" \
-n 30 \
-g 0 \
-V 0.1 \
-cW 3 -cd 4 -cw 4 -cP 3
I did not try to tune the hyper parameters and relied on the defaults from the lib, but it has given decent results
Let's evaluate with the evaluation data that we generated earlier
%%bash
python -m spacy evaluate \
/content/data/NER\ model/model-best \
/content/drive/MyDrive/chemdNER\ data/annot_data/eval_data.json \
-g 0 -dp /content/data/eval\ results -R
The results of some text are persisted and can be viewed in the archive.
Here I tried to write code for simple annotation format, mostly for archival, can be skipped, if need be
def extract_annot_data(annot_df: pd.DataFrame, ext_from: str) -> list:
# groups text and generates the list of entities and
# their spans in the text
train_data = []
for _, annot_data in annot_df.groupby(ext_from):
ttext = annot_data.iloc[0][ext_from]
entity_list = []
# print(ttext)
text_ranges = []
for annot in annot_data.itertuples():
start = int(annot.Start)
end = int(annot.End)
if start not in text_ranges and end not in text_ranges:
text_ranges.extend(list(range(start, end+1,)))
entity_list.append((annot.Start, annot.End, annot.Entity))
train_data.append((ttext, {'entities':entity_list}))
return train_data
abst_annots = extract_annot_data(abstract_annots, 'Abstract')
title_annots = extract_annot_data(title_annots, 'Title')
abst_annots[0]
A sample text from the corpus
ttext
from spacy.gold import docs_to_json, GoldParse
A sample of the spacy json format for training models
docs_to_json([mod(ttext)])