Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to train from memory #544

Merged
merged 8 commits into from
Nov 28, 2020
Merged

Ability to train from memory #544

merged 8 commits into from
Nov 28, 2020

Conversation

n1t0
Copy link
Member

@n1t0 n1t0 commented Nov 25, 2020

Adds the ability to train from an Iterator in Rust, and anything that can be used as an iterator in Python too.

Training a tokenizer using datasets or a List[str] roughly takes as much time as training from files (cf examples/train_with_datasets.py)

Fix #198 & Fix #524

Still need to add:

  • Documentation (API Reference + examples)

@n1t0 n1t0 force-pushed the trainer-experiments branch 4 times, most recently from 3580858 to 6e066d8 Compare November 28, 2020 17:02
@n1t0 n1t0 force-pushed the trainer-experiments branch from 6e066d8 to f5ec740 Compare November 28, 2020 17:13
@n1t0 n1t0 merged commit 49bd055 into master Nov 28, 2020
@n1t0 n1t0 deleted the trainer-experiments branch November 28, 2020 17:29
@n1t0 n1t0 mentioned this pull request Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract the word-counts to each Trainer Training a model from in-memory data
1 participant