Skip to content

v0.6

Compare
Choose a tag to compare
@kosloot kosloot released this 05 Jun 10:49
· 373 commits to master since this release

Intermediate release, with a lot of new code to handle N-grams
Also a lot of refactoring is done, for more clear and maintainable code.
This is work in progress still.

  • TICCL-unk:

    • more extensive acronym detection
    • fixed artifreq problems in 'clean' punctuated words
    • added filters for 'unwanted' characters
    • added a ligature filter to convert evil ligatures
    • normalize all hyphens to a 'normal' one (-)
    • use a better definition of punctuation (unicode character class is not
      good enough to decide)
  • TICCL-lexstat:

    • the 'separator' symbol should get freq=0, so it isnt counted
    • the clip value is added to the output filename
  • TICCL-indexer:

    • indexer and indexerNT now produce the same output, using different
      strategies when a --foci files is used.
  • TICCL-LDcalc:
    major overhaul for n-grams

    • added a ngram point column to the output (so NOT backward compatible!)
    • produce a '.short' list for short word corrections
    • produce a '.ambi' file with a list of n-grams related to short words
    • prune a lot of ngrams from the output
  • TICCL-rank:

  • output is sorted now
  • honor the ngram-points from the new LDcalc. (so NOT backward compatible!)
  • TICCL-chain: new module to chain ranked files

  • TICCL-lexclean:
    -added a -x option for 'inverse' alphabet

  • TICCL-anahash:

    • added a --list option to produce a list of words and anagram values
  • added metadata file: codemeta.json