Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
configs		configs
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
project.yml		project.yml
requirements.txt		requirements.txt
test_project_ner_tweets.py		test_project_ner_tweets.py

README.md

🪐 Weasel Project: Detecting people entities in tweets (Named Entity Recognition)

This project demonstrates how to improve spaCy's pretrained models by augmenting the training data and adapting it to a different domain.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

Command	Description
`download`	Download models
`preprocess`	Convert raw inputs into spaCy's binary format
`decompress`	Decompress relevant assets that will be used latter by our weak supervision model
`augment`	Augment an input dataset via weak supervision then split it into training and evaluation datasets
`train`	Train a named entity recognition model
`train-with-augmenter`	Train a named entity recognition model with a spaCy built-in augmenter
`evaluate`	Evaluate various experiments and export the computed metrics
`package`	Package the trained model so it can be installed

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow	Steps
`all`	`download` → `preprocess` → `decompress` → `augment` → `train` → `train-with-augmenter` → `evaluate`
`setup`	`download` → `preprocess` → `decompress`
`finetune`	`augment` → `train` → `train-with-augmenter` → `evaluate`

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

File	Source	Description
`assets/train_raw.jsonl`	Local	The training dataset to augment. Later we'll divide this further to create a dev set
`assets/test_annotated.jsonl`	Local	The held-out test dataset to evaluate our output against.
`assets/btc.tar.gz`	URL	A model trained on the Broad Twitter Corpus to help in annotation (63.7 MB)
`assets/crunchbase.json.gz`	URL	A list of crunchbase entities to aid the gazetteer annotator (8.56 MB)
`assets/wikidata_tokenised.json.gz`	URL	A list of wikipedia entities to aid the gazetteer annotator (21.1 MB)
`assets/first_names.json`	URL	A list of first names to help our heuristic annotator
`assets/en_orth_variants.json`	URL	Orth variants to use for data augmentation

Usage Notes

Weak supervision is the practice of labelling a dataset using imprecise annotators. Instead of labelling them manually by ourselves (which may take time and effort), we can just encode business rules or heuristics to do the labelling for us.

Of course, these rules don't capture all the nuances a human annotator can. So weak supervision's idea is to pool all these naive annotators together, and come up with a unified annotator that can hopefully understand those nuances. The pooling is done via a Hidden Markov Model (HMM).

Once we have the unified annotator, we re-annotate our entire training dataset, and use that for our downstream tasks, which in this case, is finetuning spaCy's en_core_web_lg model.

Using `skweak` as our weak supervision framework

In this project, we will be using skweak as our weak supervision framework. It contains primitives to define our own labelling functions, and it's well-integrated to spaCy! These labelling functions can then be defined as one of the following:

Annotator Type	What it does
Model	makes use of existing models trained from other datasets. Here we can also use spaCy's pretrained models like `en_core_web_lg`
Gazetteer	a list of entities that an annotator can look up from. This will always depend on your use-case.
Heuristics	can be used to encode business rules based on one's domain. You can check for tokens that fall under a specific rule (`TokenConstraintAnnotator`), or look for spans of entities in a sentence (`SpanConstraintAnnotator`).

The unified model is then done using a Hidden Markov Model, in this case, via the hmmlearn pacakge. Once we have our final model, we re-annotate the dataset and resume training via spacy train.

Weak supervision in practice

For our dataset, we have a collection of tweets from the TweetEval Benchmark. Because tweets differ in structure and form compared to the datasets our original spaCy models were trained on, it makes sense to adapt the domain. Here are some annotators I created:

Annotator	Annotator Type	Heuristic
Model-based annotator from en_core_web_lg	Model	We can reuse spaCy models to bootstrap our annotation process. By doing this, we don't have to start entirely from scratch.
Broad Twitter Corpus model	Model	Because our dataset is made of tweets, it makes sense to bootstrap from a language model trained on the same domain.
List of crunchbase personalities	Gazetteer	Obtaining a list of tech personalities is timely, especially due to the context of some tweets.
Proper names annotator	Heuristics	A simple heuristic that checks if two proper names are joined by an infix (e.g. Vincent van Gogh, Mark de Castro, etc.)
Full names annotator	Heuristics	A simple heuristic for finding full names. It checks if a first name exists in a list of names, and is also followed by a proper name.

There are also some annotators that I've tried, but removed due to its detrimental effect to our evaluation scores:

Annotator	Annotator Type	Heuristic
List of personalities from Wikipedia	Gazetteer	Ideally, we want to include all person entities from a bigger database. However, it detects "you" as a `PERSON`, which affected our Precision.
Name suffix annotator	Heuristic	I wanted to capture names with "iii", "IV", etc., but it gives lower precision due to it detecting roman numerals that aren't part of names.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ner_tweets

ner_tweets

README.md

🪐 Weasel Project: Detecting people entities in tweets (Named Entity Recognition)

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

Usage Notes

Using `skweak` as our weak supervision framework

Weak supervision in practice

Files

ner_tweets

Directory actions

More options

Directory actions

More options

Latest commit

History

ner_tweets

Folders and files

parent directory

README.md

🪐 Weasel Project: Detecting people entities in tweets (Named Entity Recognition)

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

Usage Notes

Using skweak as our weak supervision framework

Weak supervision in practice

Using `skweak` as our weak supervision framework