Skip to content

Latest commit

 

History

History

ner_drugs

🪐 Weasel Project: Detecting drug names in online comments (Named Entity Recognition)

This project uses Prodigy to bootstrap an NER model to detect drug names in Reddit comments.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

Command Description
download Download a spaCy model with pretrained vectors
preprocess Convert the data to spaCy's binary format
train Train a named entity recognition model
evaluate Evaluate the model and export metrics
package Package the trained model so it can be installed
visualize-model Visualize the model's output interactively using Streamlit
visualize-data Explore the annotated data in an interactive Streamlit app

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow Steps
all downloadpreprocesstrainevaluate

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

File Source Description
assets/drugs_training.jsonl Local JSONL-formatted training data exported from Prodigy, annotated with DRUG entities (1477 examples)
assets/drugs_eval.jsonl Local JSONL-formatted development data exported from Prodigy, annotated with DRUG entities (500 examples)
assets/drugs_patterns.jsonl Local Patterns file generated with terms.teach and used to pre-highlight during annotation (118 patterns)

📚 Data

Labelling the data with Prodigy took a few hours and was done manually using the patterns to pre-highlight suggestions. The raw text was sourced from the r/opiates subreddit.

File Count Description
drugs_patterns.jsonl 118 Single-token patterns created with terms.teach and terms.to-patterns. Can be used with spaCy's EntityRuler for a rule-based baseline and faster NER annotation.
drugs_training.jsonl 1477 Training data annotated with DRUG entities.
drugs_eval.jsonl 500 Evaluation data annotated with DRUG entities.

Visualize the data and model

The visualize_data.py script lets you visualize the training and evaluation datasets with displaCy.

python -m spacy project run visualize-data

The visualize_model.py script is powered by spacy-streamlit and lets you explore the trained model interactively.

python -m spacy project run visualize-model

Training and evaluation data format

The training and evaluation datasets are distributed in Prodigy's simple JSONL (newline-delimited JSON) format. Each entry contains a "text" and a list of "spans" with the "start" and "end" character offsets and the "label" of the annotated entities. The data also includes the tokenization. Here's a simplified example entry:

{
  "text": "Idk if that Xanax or ur just an ass hole",
  "tokens": [
    { "text": "Idk", "start": 0, "end": 3, "id": 0 },
    { "text": "if", "start": 4, "end": 6, "id": 1 },
    { "text": "that", "start": 7, "end": 11, "id": 2 },
    { "text": "Xanax", "start": 12, "end": 17, "id": 3 },
    { "text": "or", "start": 18, "end": 20, "id": 4 },
    { "text": "ur", "start": 21, "end": 23, "id": 5 },
    { "text": "just", "start": 24, "end": 28, "id": 6 },
    { "text": "an", "start": 29, "end": 31, "id": 7 },
    { "text": "ass", "start": 32, "end": 35, "id": 8 },
    { "text": "hole", "start": 36, "end": 40, "id": 9 }
  ],
  "spans": [
    {
      "start": 12,
      "end": 17,
      "token_start": 3,
      "token_end": 3,
      "label": "DRUG"
    }
  ],
  "_input_hash": -2128862848,
  "_task_hash": -334208479,
  "answer": "accept"
}

Data creation workflow

  1. Create a terminology list using 3 seed terms.
    prodigy terms.teach drugs_terms en_core_web_lg --seeds "heroin, benzos, weed"
  2. Convert the termonology list to patterns.
    prodigy terms.to-patterns drugs_terms > drugs_patterns.jsonl
  3. Manually create the training and evaluation data or use an entity ruler with the patterns to pre-highlight suggestions.
    prodigy ner.manual drugs_data en_core_web_sm ./raw_text.jsonl --label DRUG
    prodigy ner.make-gold drugs_data ./rule-based-model ./raw_text.jsonl --label DRUG --unsegmented