This project uses Prodigy to bootstrap an NER model to detect drug names in Reddit comments.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
Weasel documentation.
The following commands are defined by the project. They
can be executed using weasel run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
download |
Download a spaCy model with pretrained vectors |
preprocess |
Convert the data to spaCy's binary format |
train |
Train a named entity recognition model |
evaluate |
Evaluate the model and export metrics |
package |
Package the trained model so it can be installed |
visualize-model |
Visualize the model's output interactively using Streamlit |
visualize-data |
Explore the annotated data in an interactive Streamlit app |
The following workflows are defined by the project. They
can be executed using weasel run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
download → preprocess → train → evaluate |
The following assets are defined by the project. They can
be fetched by running weasel assets
in the project directory.
File | Source | Description |
---|---|---|
assets/drugs_training.jsonl |
Local | JSONL-formatted training data exported from Prodigy, annotated with DRUG entities (1477 examples) |
assets/drugs_eval.jsonl |
Local | JSONL-formatted development data exported from Prodigy, annotated with DRUG entities (500 examples) |
assets/drugs_patterns.jsonl |
Local | Patterns file generated with terms.teach and used to pre-highlight during annotation (118 patterns) |
Labelling the data with Prodigy took a few hours and was done manually using the patterns to pre-highlight suggestions. The raw text was sourced from the r/opiates subreddit.
File | Count | Description |
---|---|---|
drugs_patterns.jsonl |
118 | Single-token patterns created with terms.teach and terms.to-patterns . Can be used with spaCy's EntityRuler for a rule-based baseline and faster NER annotation. |
drugs_training.jsonl |
1477 | Training data annotated with DRUG entities. |
drugs_eval.jsonl |
500 | Evaluation data annotated with DRUG entities. |
The visualize_data.py
script lets you visualize
the training and evaluation datasets with
displaCy.
python -m spacy project run visualize-data
The visualize_model.py
script is powered by
spacy-streamlit
and lets you
explore the trained model interactively.
python -m spacy project run visualize-model
The training and evaluation datasets are distributed in Prodigy's simple JSONL
(newline-delimited JSON) format. Each entry contains a "text"
and a list of
"spans"
with the "start"
and "end"
character offsets and the "label"
of
the annotated entities. The data also includes the tokenization. Here's a
simplified example entry:
{
"text": "Idk if that Xanax or ur just an ass hole",
"tokens": [
{ "text": "Idk", "start": 0, "end": 3, "id": 0 },
{ "text": "if", "start": 4, "end": 6, "id": 1 },
{ "text": "that", "start": 7, "end": 11, "id": 2 },
{ "text": "Xanax", "start": 12, "end": 17, "id": 3 },
{ "text": "or", "start": 18, "end": 20, "id": 4 },
{ "text": "ur", "start": 21, "end": 23, "id": 5 },
{ "text": "just", "start": 24, "end": 28, "id": 6 },
{ "text": "an", "start": 29, "end": 31, "id": 7 },
{ "text": "ass", "start": 32, "end": 35, "id": 8 },
{ "text": "hole", "start": 36, "end": 40, "id": 9 }
],
"spans": [
{
"start": 12,
"end": 17,
"token_start": 3,
"token_end": 3,
"label": "DRUG"
}
],
"_input_hash": -2128862848,
"_task_hash": -334208479,
"answer": "accept"
}
- Create a terminology list using 3 seed terms.
prodigy terms.teach drugs_terms en_core_web_lg --seeds "heroin, benzos, weed"
- Convert the termonology list to patterns.
prodigy terms.to-patterns drugs_terms > drugs_patterns.jsonl
- Manually create the training and evaluation data or use an
entity ruler with
the patterns to pre-highlight suggestions.
prodigy ner.manual drugs_data en_core_web_sm ./raw_text.jsonl --label DRUG
prodigy ner.make-gold drugs_data ./rule-based-model ./raw_text.jsonl --label DRUG --unsegmented