-
Notifications
You must be signed in to change notification settings - Fork 0
Backend: PAV
The pav
backend implements a trainable dynamic ensemble that intelligently combines results from multiple projects. Subject suggestion requests to the ensemble backend will be re-routed to the source projects. The results from the source projects will be re-weighted using isotonic regression, which attempts to convert raw scores to probabilities. The regression is implemented using the PAV algorithm available in the scikit-learn library. The regression is performed separately for each concept and the results are combined by calculating the mean of regressed scores (i.e. estimated probabilities) for each concept.
Note: See vw_ensemble for an alternative dynamic ensemble backend that can also be further trained during use, unlike PAV.
[pav-en]
name=PAV ensemble English
language=en
backend=pav
sources=tfidf-en,maui-en
min-docs=3
limit=100
vocab=yso-en
The sources
setting is a comma-separated list of projects whose results will be combined. Optional weights may be given like this:
sources=tfidf-en:1,maui-en:2
This setting would give twice as much weight on results from maui-en
compared to results from tfidf-en
.
The min-docs
setting specifies how many positive examples of a concept are required in the training data in order to create a regression model for that concept. Recommended values are between 3 and 10. When not enough positive examples are available, raw scores are used instead, similar to the basic ensemble backend.
Load a vocabulary:
annif loadvoc pav-en /path/to/Annif-corpora/vocab/yso-en.tsv
Train the ensemble:
annif train pav-en /path/to/Annif-corpora/training/yso-finna-en.tsv.gz
Test the model with a single document:
cat document.txt | annif suggest pav-en
Evaluate a directory full of files in fulltext document corpus format:
annif eval pav-en /path/to/documents/
- Home
- Getting started
- System requirements
- Optional features and dependencies
- Usage with Docker
- Architecture
- Commands
- Web user interface
- Corpus formats
- Project configuration
- Analyzers
- Achieving good results
- Reusing preprocessed training data
- Running as a WSGI service
- Backends
- Development flow, branches and tags
- Release process
- Creating a new backend