- Annotation process via Prodigy annotation tool
- Weights & Biases for logging of training experiments
NER project for spaCy v3. The project data comes from kaggle:
- BBC (https://www.kaggle.com/hgultekin/bbcnewsarchive)
- NG (https://www.kaggle.com/salmaelanigri/doc-class)
Label scheme:
Component | Label |
---|---|
NER |
PERSON |
ENTITY_RULER |
EMAIL |
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows.
The following commands are defined by the project. They
can be executed using spacy project run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
data-to-spacy |
Merge your annotations and create data in spaCy's binary format |
train_spacy |
Train a named entity recognition model with spaCy and log the results via wandb |
train_prodigy |
Train a named entity recognition model with Prodigy |
train_curve |
Train the model with Prodigy by using different portions of training examples to evaluate if more annotations can potentially improve the performance |
evaluate |
Evaluate the model and export metrics via spaCy |
visualize-model |
Visualize the model's output interactively using Streamlit |
visualize-data |
Visualize the data interactively using Streamlit |
package |
Package the trained model so it can be installed |
The following workflows are defined by the project. They
can be executed using spacy project run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
data-to-spacy → train_spacy → evaluate |
all_prodigy |
train_prodigy → train_curve |
The following raw assets are defined by the project.
File | Source | Description |
---|---|---|
assets/raw/UC1_train_meta.jsonl |
Local | JSONL-formatted raw training data (1778 docs) |
assets/raw/UC1_eval_meta.jsonl |
Local | JSONL-formatted raw development data (593 docs) |
# Annotations | # PERSON | ||
---|---|---|---|
correct_UC01_train |
3500 | 1011 | 272 |