Large Dataset and MP Optimizations

johntmyers released this 04 Sep 20:24

· 227 commits to master since this release

e3afdbf

Included SentencePiece support for tokenizing large datasets. By default, if a dataset is larger than 100,000 lines, we will utilize a sample of 100k to build the tokenizer.
Switched to a loky based backend for multi-processing. This fixes several run-time bugs on Windows platforms and increases general stability for consistent multi-processing text generation.

Assets 2

Provide feedback