Skip to content

V2.1.0 - MWE detection, detokenizer, improved whitespace/punctuation handling

Compare
Choose a tag to compare
@amir-zeldes amir-zeldes released this 26 Oct 14:41
· 340 commits to master since this release
57c668e
  • Multiword expression detection based on Coptic Dictionary Online (use -m option)
  • Detokenizer to auto-adjust bound groups to Layton's segmentation standards:
    • -d 1 = conservative (only re-bind high certainty groups from alternative editorial practices)
    • -d 2 = aggressive (re-bind anything that doesn't look like it should be separate)
    • --segment_merged option: enforces a boundary at detokenized merge positions
  • Improved --space option to separate punctuation spelled together with bound groups
  • Various bug fixes and performance improvements