This blog compared Spacy and Gensim well so I summarized here.

In short, “Spacy is a natural language processing library for Python designed to have fast performance, and with word embedding models built in. Gensim is a topic modelling library for Python that provides modules for training Word2Vec and other word embedding algorithms, and allows using pre-trained models.”

Spacy’s feature is that it provides pretrained models, and on its website, it points to multiple languages: Chinese, Danish, Dutch, English, French, German, Greek, Italian, Japanese, Lithuanian, Multi-language, Norwegian Bokmål, Polish, Portuguese, Romanian, Russian, Spanish. A wider list is referable here.

While Gensim does not provide pretrained models, but it does provide pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. Gensim allows to create our own model. Custom model apparently are more powerful in domain specific analysis. Be noted that it’s also possible to use the model we trained with Spacy, taking advantage of the tools that Sapcy provides.

Our internal ML team has already tried Spacy on entity recognition, the following is the read-me file, and the git.

spaCy-NER-IRN

Repository for training and hosting a custom NER model on IRN data using spaCy.

Overview

This repo supports spaCy v2.1’s new spacy pretrain function, which allows developers to produce a pretrained similar to BERT, ELMo, XLNet and more. However, spacy’s models are CNN rather than transformer based and thus run much faster on CPUs, in addition to being more compact. Thus the motivation here is to harness the benefits of large pretrained models on inexpensive unlabeled texts, without compromising inference speed and model footprint.

Model Description

As mentioned previously, the overall NER model is CNN-based followed by a softmax layer to classify entity types. The best version of the model uses a contextualized embedding component (similar to BERT, ELMo, etc.) pretrained using spaCy’s ‘LMAO’ objective, where instead of predicting the next word or the masked word, which requires an expensive softmax calculation over a large vocabulary, tries to predict the “embedding” of the next word (without seeing the next word).

Running Pretraining

python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]

texts_loc: a new-line delimited json file with {text: ‘lorem…’} JSON object per text document.
vectors_model: the target embeddings that will be used as objectives when training the model to embed a context
output_dir: the output directory to store the pretrained model and all its components

For more info on the optional arguments (additional model hyperparameters) see the spacy documentation. These hyperparams may improve the expressivity of your pretrained language model and improve downstream supervised performance.

Running training

python train.py --model=blank --data_dir=/path/to/json/entity_data --pretrained=/path/to/pretrained_dir

model: spacy model to use to initialize training; if not ‘blank’, will finetune the existing NER component instead of training from scratch
data_dir: directory containing labeled json entity data: files should be named ‘entity_offsets_train.json’, ‘entity_offsets_dev.json’, and ‘entity_offsets_test.json’.
pretrained: directory containing the pretrained spacy.Tok2Vec (contextual embedder) component. The newest model.bin file will be used to initialize the spacy.Tok2Vec component.

Data Format

See /data/entity_offsets_train.json for an example of the expected data format. If you have a BIO-formatted file (where each line is a non-punctuation bounded token with a BIO label), you can convert it to the expected JSON data format using spacy_ner.file_utils.bio_to_offsets.

Results

Using the pretraining functionality in spacy v2.1 improved NER performance significantly in experiments during Anna development. Here are some comparisons of different models on our IRN dataset (for NE-ORG).

Model	Precision	Recall	F1
en-core-web-md (blank)	74.8	65.4	69.8
en-core-web-lg (fine-tuned)	77.6	74.3	75.9
en-core-web-md (pretrained)	81	74.8	77.8
ELMo	80.7	77.5	79.0

We did not find additional improvements when using en-core-web-lg for pretraining. The news dataset used for pretraining our spacy model was around 300MB in size, which is quite small. It may be possible to achieve even better performance with a larger unlabeled dataset and a higher capacity model (which can be specified by adjusting the hyperparameters when calling spacy pretrain).

#Spacy File-util.py
from pathlib import Path
from spacy.tokens import Doc
import spacy

def init_directory(path):
path = Path(path)
if not path.exists() or not path.is_dir():
path.mkdir()

def find_newest_file(dir_path, ext=”.bin”):
dir_path = Path(dir_path)
if not dir_path.exists():
return None

files = [p for p in dir_path.iterdir() if p.is_file() and p.suffix == ext]
files = sorted(files, key=lambda p: -p.stat().st_mtime)
return files[0]

def bio_to_biluo(tags):
tags_biluo = []
prev_tag = ‘O’
for t in tags:
if t == ‘O’ or t.startswith(‘B-‘):
if prev_tag.startswith(‘I-‘):
tags_biluo[-1] = ‘L-‘ + prev_tag[2:]
elif prev_tag.startswith(‘B-‘):
tags_biluo[-1] = ‘U-‘ + prev_tag[2:]
tags_biluo.append(t)
prev_tag = t

if prev_tag.startswith(‘B-‘):
tags_biluo[-1] = ‘U-‘ + prev_tag[2:]
elif prev_tag.startswith(‘I-‘):
tags_biluo[-1] = ‘L-‘ + prev_tag[2:]

return tags_biluo

def bio_to_offsets(path, schema, doc_delimiter=’-DOCSTART-‘):
docs = []
nlp = spacy.blank(‘en’)
with path.open(‘r’) as bio_file:
curr_doc = {col: [] for col in schema}
for line in bio_file:
doc_finished = line.startswith(doc_delimiter)
if doc_finished:
if curr_doc[‘Text’]:
doc_id = curr_doc[‘DocId’][0] if ‘DocId’ in schema else len(docs)
doc = Doc(vocab=nlp.vocab, words=curr_doc[‘Text’])
offsets = spacy.gold.offsets_from_biluo_tags(
doc,
bio_to_biluo(curr_doc[‘Tag’])
)
docs.append({
‘Text’: doc.text,
‘Entities’: offsets,
‘DocId’: doc_id
})
curr_doc = {col: [] for col in schema} #this is cool
continue

line = line.strip()
if not line:
continue

line_vals = line.strip().split()
for column, col_idx in schema.items():
curr_doc[column].append(line_vals[col_idx])

return docs

Naixian Zhang

Spacy and Gensim in NLP

spaCy-NER-IRN

Overview

Model Description

Running Pretraining

Running training

Data Format

Results

Leave a comment Cancel reply

spaCy-NER-IRN

Overview

Model Description

Running Pretraining

Running training

Data Format

Results

Share this:

Related

Leave a comment Cancel reply