Releases: CopticScriptorium/coptic-nlp
Releases · CopticScriptorium/coptic-nlp
V4.0.0 - Support for Bohairic and more
- Support for Bohairic Coptic using the
--dialect
flag and automatic dialect detection - Neural sentence splitter (Sahidic)
- Context sensitive language of origin detection (e.g. distinguish Greek/Coptic ⲟⲩⲛ)
- Update models to Coptic Scriptorium corpora v6.0.0
- Updated static and transformer embeddings for Sahidic and Bohairic
- OT mode flag for entity linking (e.g. force Jesus in OT to be Joshua Son of Nun)
- Lexicon and library version updates
V3.0.0 - New tools and improved accuracy
This version introduces new and improved tools, focusing on out-of-domain accuracy and robustness:
- New 3 step normalization framework using Foma
- Added smart rebinding module (
-d 3
) by @lgessler - New stacked segmentation, now using xgboost and better handling of ambiguous groups
- New POS tagger using Marmot
- Hyperparameter optimization
- Various data/lexicon/ruleset improvements and bugfixes
- Complete unit test suite in
run_tests.py
and evaluation suite ineval/
V2.2.0 - bugfix and better interface to detokenizer
- Refactor rf_tokenizer for relative import as single file
- Smarter auto 'line' tag detection in api.py
- Adjust boundaries within thetas in 'from pipes' mode (bug fix)
- Add detokenizer to web interface
- More control over detokenizer aggressive/conservative + split norms at group merge point
- Option to merge gold trees into pipeline
V2.1.0 - MWE detection, detokenizer, improved whitespace/punctuation handling
- Multiword expression detection based on Coptic Dictionary Online (use
-m
option) - Detokenizer to auto-adjust bound groups to Layton's segmentation standards:
- -d 1 = conservative (only re-bind high certainty groups from alternative editorial practices)
- -d 2 = aggressive (re-bind anything that doesn't look like it should be separate)
--segment_merged
option: enforces a boundary at detokenized merge positions
- Improved
--space
option to separate punctuation spelled together with bound groups - Various bug fixes and performance improvements