This blog compared Spacy and Gensim well so I summarized here.
In short, “Spacy is a natural language processing library for Python designed to have fast performance, and with word embedding models built in. Gensim is a topic modelling library for Python that provides modules for training Word2Vec and other word embedding algorithms, and allows using pre-trained models.”
Spacy’s feature is that it provides pretrained models, and on its website, it points to multiple languages: Chinese, Danish, Dutch, English, French, German, Greek, Italian, Japanese, Lithuanian, Multi-language, Norwegian Bokmål, Polish, Portuguese, Romanian, Russian, Spanish. A wider list is referable here.
While Gensim does not provide pretrained models, but it does provide pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. Gensim allows to create our own model. Custom model apparently are more powerful in domain specific analysis. Be noted that it’s also possible to use the model we trained with Spacy, taking advantage of the tools that Sapcy provides.
Our internal ML team has already tried Spacy on entity recognition, the following is the read-me file, and the git.
spaCy-NER-IRN
Repository for training and hosting a custom NER model on IRN data using spaCy.
Overview
This repo supports spaCy v2.1’s new spacy pretrain function, which allows developers to produce a pretrained similar to BERT, ELMo, XLNet and more. However, spacy’s models are CNN rather than transformer based and thus run much faster on CPUs, in addition to being more compact. Thus the motivation here is to harness the benefits of large pretrained models on inexpensive unlabeled texts, without compromising inference speed and model footprint.
Model Description
As mentioned previously, the overall NER model is CNN-based followed by a softmax layer to classify entity types. The best version of the model uses a contextualized embedding component (similar to BERT, ELMo, etc.) pretrained using spaCy’s ‘LMAO’ objective, where instead of predicting the next word or the masked word, which requires an expensive softmax calculation over a large vocabulary, tries to predict the “embedding” of the next word (without seeing the next word).
Running Pretraining
python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
- texts_loc: a new-line delimited json file with {text: ‘lorem…’} JSON object per text document.
- vectors_model: the target embeddings that will be used as objectives when training the model to embed a context
- output_dir: the output directory to store the pretrained model and all its components
For more info on the optional arguments (additional model hyperparameters) see the spacy documentation. These hyperparams may improve the expressivity of your pretrained language model and improve downstream supervised performance.
Running training
python train.py --model=blank --data_dir=/path/to/json/entity_data --pretrained=/path/to/pretrained_dir
- model: spacy model to use to initialize training; if not ‘blank’, will finetune the existing NER component instead of training from scratch
- data_dir: directory containing labeled json entity data: files should be named ‘entity_offsets_train.json’, ‘entity_offsets_dev.json’, and ‘entity_offsets_test.json’.
- pretrained: directory containing the pretrained spacy.Tok2Vec (contextual embedder) component. The newest model.bin file will be used to initialize the spacy.Tok2Vec component.
Data Format
See /data/entity_offsets_train.json for an example of the expected data format. If you have a BIO-formatted file (where each line is a non-punctuation bounded token with a BIO label), you can convert it to the expected JSON data format using spacy_ner.file_utils.bio_to_offsets.
Results
Using the pretraining functionality in spacy v2.1 improved NER performance significantly in experiments during Anna development. Here are some comparisons of different models on our IRN dataset (for NE-ORG).
| Model | Precision | Recall | F1 |
|---|---|---|---|
| en-core-web-md (blank) | 74.8 | 65.4 | 69.8 |
| en-core-web-lg (fine-tuned) | 77.6 | 74.3 | 75.9 |
| en-core-web-md (pretrained) | 81 | 74.8 | 77.8 |
| ELMo | 80.7 | 77.5 | 79.0 |
We did not find additional improvements when using en-core-web-lg for pretraining. The news dataset used for pretraining our spacy model was around 300MB in size, which is quite small. It may be possible to achieve even better performance with a larger unlabeled dataset and a higher capacity model (which can be specified by adjusting the hyperparameters when calling spacy pretrain).
#Spacy File-util.py
from pathlib import Path
from spacy.tokens import Doc
import spacy
def init_directory(path):
path = Path(path)
if not path.exists() or not path.is_dir():
path.mkdir()
def find_newest_file(dir_path, ext=”.bin”):
dir_path = Path(dir_path)
if not dir_path.exists():
return None
files = [p for p in dir_path.iterdir() if p.is_file() and p.suffix == ext]
files = sorted(files, key=lambda p: -p.stat().st_mtime)
return files[0]
def bio_to_biluo(tags):
tags_biluo = []
prev_tag = ‘O’
for t in tags:
if t == ‘O’ or t.startswith(‘B-‘):
if prev_tag.startswith(‘I-‘):
tags_biluo[-1] = ‘L-‘ + prev_tag[2:]
elif prev_tag.startswith(‘B-‘):
tags_biluo[-1] = ‘U-‘ + prev_tag[2:]
tags_biluo.append(t)
prev_tag = t
if prev_tag.startswith(‘B-‘):
tags_biluo[-1] = ‘U-‘ + prev_tag[2:]
elif prev_tag.startswith(‘I-‘):
tags_biluo[-1] = ‘L-‘ + prev_tag[2:]
return tags_biluo
def bio_to_offsets(path, schema, doc_delimiter=’-DOCSTART-‘):
docs = []
nlp = spacy.blank(‘en’)
with path.open(‘r’) as bio_file:
curr_doc = {col: [] for col in schema}
for line in bio_file:
doc_finished = line.startswith(doc_delimiter)
if doc_finished:
if curr_doc[‘Text’]:
doc_id = curr_doc[‘DocId’][0] if ‘DocId’ in schema else len(docs)
doc = Doc(vocab=nlp.vocab, words=curr_doc[‘Text’])
offsets = spacy.gold.offsets_from_biluo_tags(
doc,
bio_to_biluo(curr_doc[‘Tag’])
)
docs.append({
‘Text’: doc.text,
‘Entities’: offsets,
‘DocId’: doc_id
})
curr_doc = {col: [] for col in schema} #this is cool
continue
line = line.strip()
if not line:
continue
line_vals = line.strip().split()
for column, col_idx in schema.items():
curr_doc[column].append(line_vals[col_idx])
return docs