This project uses spaCy with annotated data from Prodigy to train a binary text classifier to predict whether a GitHub issue title is about documentation. The pipeline uses the component textcat_multilabel
in order to train a binary classifier using only one label, which can be True or False for each document. An equivalent alternative for a binary text classifier would be to use the textcat
component with two labels, where exactly one of the two labels is True for each document.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
Weasel documentation.
The following commands are defined by the project. They
can be executed using weasel run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
preprocess |
Convert the data to spaCy's binary format |
train |
Train a text classification model |
evaluate |
Evaluate the model and export metrics |
The following workflows are defined by the project. They
can be executed using weasel run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
preprocess → train → evaluate |
The following assets are defined by the project. They can
be fetched by running weasel assets
in the project directory.
File | Source | Description |
---|---|---|
assets/docs_issues_training.jsonl |
Local | JSONL-formatted training data exported from Prodigy, annotated with DOCUMENTATION (661 examples) |
assets/docs_issues_eval.jsonl |
Local | JSONL-formatted development data exported from Prodigy, annotated with DOCUMENTATION (500 examples) |
Labelling the data with Prodigy took about two hours and was
done manually using the binary classification interface. The raw text was
sourced from the GitHub API using
the search queries "docs"
, "documentation"
, "readme"
and "instructions"
.
The training and evaluation datasets are distributed in Prodigy's simple JSONL
(newline-delimited JSON) format. Each entry contains a "text"
, the "label"
and an "answer"
("accept"
if the label applies, "reject"
if it doesn't
apply). Here are two simplified example entries:
{
"text": "Add FAQ's to the documentation",
"label": "DOCUMENTATION",
"answer": "accept"
}
{
"text": "Proposal: deprecate SQTagUtil.java",
"label": "DOCUMENTATION",
"answer": "reject"
}
prodigy mark docs_issues_data ./raw_text.jsonl --label DOCUMENTATION --view-id classification
We also trained
a model
using Allen AI's Autocat app (a web-based
tool for training, visualizing and showcasing spaCy text classification models).
You can try out the classifier in real-time and see the updated predictions as
you type. You can also evaluate it on your own data, download the model Python
package or just pip install
it with one command to try it locally.
View model here.
To use the JSONL data in Autocat, we added "labels": ["DOCUMENTATION"]
to all
examples with "answer": "accept"
and "labels": ["N/A"]
to all examples with
"answer": "reject"
.