This project demonstrates how to improve spaCy's pretrained models by augmenting the training data and adapting it to a different domain.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
Weasel documentation.
The following commands are defined by the project. They
can be executed using weasel run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
download |
Download models |
preprocess |
Convert raw inputs into spaCy's binary format |
decompress |
Decompress relevant assets that will be used latter by our weak supervision model |
augment |
Augment an input dataset via weak supervision then split it into training and evaluation datasets |
train |
Train a named entity recognition model |
train-with-augmenter |
Train a named entity recognition model with a spaCy built-in augmenter |
evaluate |
Evaluate various experiments and export the computed metrics |
package |
Package the trained model so it can be installed |
The following workflows are defined by the project. They
can be executed using weasel run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
download → preprocess → decompress → augment → train → train-with-augmenter → evaluate |
setup |
download → preprocess → decompress |
finetune |
augment → train → train-with-augmenter → evaluate |
The following assets are defined by the project. They can
be fetched by running weasel assets
in the project directory.
File | Source | Description |
---|---|---|
assets/train_raw.jsonl |
Local | The training dataset to augment. Later we'll divide this further to create a dev set |
assets/test_annotated.jsonl |
Local | The held-out test dataset to evaluate our output against. |
assets/btc.tar.gz |
URL | A model trained on the Broad Twitter Corpus to help in annotation (63.7 MB) |
assets/crunchbase.json.gz |
URL | A list of crunchbase entities to aid the gazetteer annotator (8.56 MB) |
assets/wikidata_tokenised.json.gz |
URL | A list of wikipedia entities to aid the gazetteer annotator (21.1 MB) |
assets/first_names.json |
URL | A list of first names to help our heuristic annotator |
assets/en_orth_variants.json |
URL | Orth variants to use for data augmentation |
Weak supervision is the practice of labelling a dataset using imprecise annotators. Instead of labelling them manually by ourselves (which may take time and effort), we can just encode business rules or heuristics to do the labelling for us.
Of course, these rules don't capture all the nuances a human annotator can. So weak supervision's idea is to pool all these naive annotators together, and come up with a unified annotator that can hopefully understand those nuances. The pooling is done via a Hidden Markov Model (HMM).
Once we have the unified annotator, we re-annotate our entire training dataset,
and use that for our downstream tasks, which in this case, is finetuning spaCy's
en_core_web_lg
model.
In this project, we will be using skweak
as our weak supervision framework.
It contains primitives to define our own labelling functions, and it's
well-integrated to spaCy! These labelling functions can then be defined as
one of the following:
Annotator Type | What it does |
---|---|
Model | makes use of existing models trained from other datasets. Here we can also use spaCy's pretrained models like en_core_web_lg |
Gazetteer | a list of entities that an annotator can look up from. This will always depend on your use-case. |
Heuristics | can be used to encode business rules based on one's domain. You can check for tokens that fall under a specific rule (TokenConstraintAnnotator ), or look for spans of entities in a sentence (SpanConstraintAnnotator ). |
The unified model is then done using a Hidden Markov Model, in this case, via
the hmmlearn pacakge. Once we
have our final model, we re-annotate the dataset and resume training via
spacy train
.
For our dataset, we have a collection of tweets from the TweetEval Benchmark. Because tweets differ in structure and form compared to the datasets our original spaCy models were trained on, it makes sense to adapt the domain. Here are some annotators I created:
Annotator | Annotator Type | Heuristic |
---|---|---|
Model-based annotator from en_core_web_lg | Model | We can reuse spaCy models to bootstrap our annotation process. By doing this, we don't have to start entirely from scratch. |
Broad Twitter Corpus model | Model | Because our dataset is made of tweets, it makes sense to bootstrap from a language model trained on the same domain. |
List of crunchbase personalities | Gazetteer | Obtaining a list of tech personalities is timely, especially due to the context of some tweets. |
Proper names annotator | Heuristics | A simple heuristic that checks if two proper names are joined by an infix (e.g. Vincent van Gogh, Mark de Castro, etc.) |
Full names annotator | Heuristics | A simple heuristic for finding full names. It checks if a first name exists in a list of names, and is also followed by a proper name. |
There are also some annotators that I've tried, but removed due to its detrimental effect to our evaluation scores:
Annotator | Annotator Type | Heuristic |
---|---|---|
List of personalities from Wikipedia | Gazetteer | Ideally, we want to include all person entities from a bigger database. However, it detects "you" as a PERSON , which affected our Precision. |
Name suffix annotator | Heuristic | I wanted to capture names with "iii", "IV", etc., but it gives lower precision due to it detecting roman numerals that aren't part of names. |