Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
configs		configs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
project.yml		project.yml
test_project_textcat_docs.py		test_project_textcat_docs.py

README.md

🪐 Weasel Project: Predicting whether a GitHub issue is about documentation (Text Classification)

This project uses spaCy with annotated data from Prodigy to train a binary text classifier to predict whether a GitHub issue title is about documentation. The pipeline uses the component textcat_multilabel in order to train a binary classifier using only one label, which can be True or False for each document. An equivalent alternative for a binary text classifier would be to use the textcat component with two labels, where exactly one of the two labels is True for each document.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

Command	Description
`preprocess`	Convert the data to spaCy's binary format
`train`	Train a text classification model
`evaluate`	Evaluate the model and export metrics

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow	Steps
`all`	`preprocess` → `train` → `evaluate`

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

File	Source	Description
`assets/docs_issues_training.jsonl`	Local	JSONL-formatted training data exported from Prodigy, annotated with `DOCUMENTATION` (661 examples)
`assets/docs_issues_eval.jsonl`	Local	JSONL-formatted development data exported from Prodigy, annotated with `DOCUMENTATION` (500 examples)

📚 Data

Labelling the data with Prodigy took about two hours and was done manually using the binary classification interface. The raw text was sourced from the GitHub API using the search queries "docs", "documentation", "readme" and "instructions".

Training and evaluation data format

The training and evaluation datasets are distributed in Prodigy's simple JSONL (newline-delimited JSON) format. Each entry contains a "text", the "label" and an "answer" ("accept" if the label applies, "reject" if it doesn't apply). Here are two simplified example entries:

{
  "text": "Add FAQ's to the documentation",
  "label": "DOCUMENTATION",
  "answer": "accept"
}

{
  "text": "Proposal: deprecate SQTagUtil.java",
  "label": "DOCUMENTATION",
  "answer": "reject"
}

Data creation workflow

prodigy mark docs_issues_data ./raw_text.jsonl --label DOCUMENTATION --view-id classification

🚘🐱 Live demo and model download

We also trained a model using Allen AI's Autocat app (a web-based tool for training, visualizing and showcasing spaCy text classification models). You can try out the classifier in real-time and see the updated predictions as you type. You can also evaluate it on your own data, download the model Python package or just pip install it with one command to try it locally. View model here.

To use the JSONL data in Autocat, we added "labels": ["DOCUMENTATION"] to all examples with "answer": "accept" and "labels": ["N/A"] to all examples with "answer": "reject".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

textcat_docs_issues

textcat_docs_issues

README.md

🪐 Weasel Project: Predicting whether a GitHub issue is about documentation (Text Classification)

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

📚 Data

Training and evaluation data format

Data creation workflow

🚘🐱 Live demo and model download

Files

textcat_docs_issues

Directory actions

More options

Directory actions

More options

Latest commit

History

textcat_docs_issues

Folders and files

parent directory

README.md

🪐 Weasel Project: Predicting whether a GitHub issue is about documentation (Text Classification)

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

📚 Data

Training and evaluation data format

Data creation workflow

🚘🐱 Live demo and model download