Large Dataset and MP Optimizations
-
Included SentencePiece support for tokenizing large datasets. By default, if a dataset is larger than 100,000 lines, we will utilize a sample of 100k to build the tokenizer.
-
Switched to a
loky
based backend for multi-processing. This fixes several run-time bugs on Windows platforms and increases general stability for consistent multi-processing text generation.