Skip to content

Large Dataset and MP Optimizations

Compare
Choose a tag to compare
@johntmyers johntmyers released this 04 Sep 20:24
· 227 commits to master since this release
e3afdbf
  • Included SentencePiece support for tokenizing large datasets. By default, if a dataset is larger than 100,000 lines, we will utilize a sample of 100k to build the tokenizer.

  • Switched to a loky based backend for multi-processing. This fixes several run-time bugs on Windows platforms and increases general stability for consistent multi-processing text generation.