We compare, and contrast two part-of-speech taggers’ (HMM and Brill) performance on in-domain and out-of-domain text samples.
Input data: POS tagged sentences from The Georgetown University Multilayer Corpus (GUM)
The training and test files have a .txt format. Each line has a word and POS tag and each sentence is separated by an empty line.Below is an example of the structure:
Always RB
wear VB
ballet NN
slippers NNS
. .
Stretch VB
your PRP$
The training data is under data/train.txt
The in-domain test data is under data/test.txt
The out-of-domain test data is under data/test_ood.txt
The POS tags follow the Penn Treebank (PTB) tagging scheme, described here
- We trained the HMM and Brill tagger on the training set and tuned each to find the best performance.
- We measured the performance of the taggers on in-domain and out-of-domain test sets.
The program’s output file is a .txt file in the same format as the input training file.
Further details and results can be found here
Leen Alzebdeh @Leen-Alzebdeh
Sukhnoor Khehra @Sukhnoor-K
- https://gist.github.com/blumonkey/007955ec2f67119e0909
- https://stats.stackexchange.com/questions/366552/nlp-various-probabilities-estimators-in-nltk
- https://www.nltk.org/_modules/nltk/tag/hmm.html
- https://gist.github.com/h-alg/4ec991f90a682c6d0a0b
- https://www.nltk.org/_modules/nltk/tag/brill.html
- https://www.nltk.org/api/nltk.tag.brill_trainer.html
- Github Copilot
main.py L:4, 13
for extracting command line args.main.py L:8, 104
for creating directory of output.
Ensure Python is installed, as well as the Python Standard Library. To download Python if it is not already installed, follow the instructions on the following website: https://www.python.org/downloads/.
Ensure you have training and test input data in the format outlined above and in a directory 'data/' Example usage: use the following commands in the current directory.
For using the HMM tagger on in-domain data:
python3 src/main.py --tagger hmm --train data/train.txt --test data/test.txt --output output/test_hmm.txt
For using the HMM tagger in out-of-domain data:
python3 src/main.py --tagger hmm --train data/train.txt --test data/test_ood.txt --output output/test_ood_hmm.txt
For using the Brill tagger on in-domain data:
python3 src/main.py --tagger brill --train data/train.txt --test data/test.txt --output output/test_brill.txt
For using the Brill tagger on out-of-domain data:
python3 src/main.py --tagger brill --train data/train.txt --test data/test_ood.txt --output output/test_ood_brill.txt