-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] train from memory. #512
Conversation
First attempt to train from memory. Nice things: - It works - It's relatively simple - It enables shifting paradigm so that `feed_train_chunk` could start to directly feed some inner struct of `Trainer` (pairs for BPE, sentences for Unigram) and bypass `word_counts` later. - `datasets` training is somehow as fast as from raw file (~20% slowdown on wikitext-103-raw ?). Negatives: - Puts burden on the caller to feed chunks large enough that parallelization actually achieves something. - Adds a new item in `Tokenizer` struct that does not really belong (it belongs to the trainer and could be different from words, but this state is temporary). - In the case of `datasets` we are still having a slowdown because we have to recreate a `string` chunk for a list of strings that gets splitted later. - We are accepting only strings, not raw bytes which does incur some overhead
Nice! I've been thinking about your comments and experimented a bit, and I think I might have something that can avoid almost/all the negative points stated above, without complicating things too much and even simplifying some others.
I'm not sure to understand exactly how you would feed types specific to each
I think the way to avoid this is to actually parallelize over the iterator instead of each element. We could absolutely be training over an array with small sentences, and I even feel like it will be the case most of the time. Plus, we can expect to get at least as many elements in the iterator as the number of threads in the pool, while we can't really expect each string to be long enough and to include some
These are not necessary with the solution below.
With #519, we remove So in order to achieve this, I think we can have something along the lines of: pub trait Trainer {
[...]
// No need to give anything else than the Model to train
fn train(&self, model: &mut Self::Model) -> Result<Vec<AddedToken>>;
// We feed the `Trainer` with an iterator, and give him a way to pre-process each sequence
fn feed<I, S, F>(&mut self, iterator: I, process: F) -> Result<()>
where
I: Iterator<Item = S> + Send,
S: AsRef<str> + Send,
F: Fn(&str) -> Result<Vec<String>> + Sync;
}
// Example of impl to handle word counts. We can factorize the inside to easily
// re-use it in the various Trainer for now.
fn feed<I, S, F>(&mut self, iterator: I, process: F) -> Result<()>
where
I: Iterator<Item = S> + Send,
S: AsRef<str> + Send,
F: Fn(&str) -> Result<Vec<String>> + Sync,
{
let words: Result<HashMap<String, u32>> = iterator
.maybe_par_bridge()
.map(|sequence| {
let words = process(sequence.as_ref())?;
let mut map = HashMap::new();
for word in words {
map.entry(word).and_modify(|c| *c += 1).or_insert(1);
}
Ok(map)
})
.reduce(
|| Ok(HashMap::new()),
|acc, ws| {
let mut acc = acc?;
for (k, v) in ws? {
acc.entry(k).and_modify(|c| *c += v).or_insert(v);
}
Ok(acc)
},
);
self.words = words?;
Ok(())
}
// On the Tokenizer, the default train now deals with iterators directly
pub fn train<T, I, S>(&mut self, trainer: &mut T, sequences: I) -> Result<&mut Self>
where
T: Trainer<Model = M> + Sync,
I: Iterator<Item = S> + Send,
S: AsRef<str> + Send,
{
let (lower, upper) = sequences.size_hint();
let len = upper.unwrap_or(lower) as u64;
let progress = if trainer.should_show_progress() {
let progress = ProgressBar::new(len);
progress.set_style(
ProgressStyle::default_bar()
.template("[{elapsed_precise}] {msg:<40!} {wide_bar} {pos:<9!}/{len:>9!}"),
);
progress.set_message("Pre-processing sequences");
progress.set_draw_delta(len / 100); // Redraw only every 2%
Some(progress)
} else {
None
};
trainer.feed(
sequences.map(|s| {
if let Some(progress) = &progress {
progress.inc(1)
}
s
}),
|seq| {
let normalized = self.do_normalize(seq.as_ref())?;
let pre_tokenized = self.do_pre_tokenize(normalized)?;
Ok(pre_tokenized
.get_splits(OffsetReferential::Original, OffsetType::Byte)
.into_iter()
.map(|(s, _, _)| s.to_owned())
.collect())
},
)?;
if let Some(pbar) = progress {
pbar.finish();
}
// let words = self.word_count(trainer, files)?;
let special_tokens = trainer.train(&mut self.model)?;
self.add_special_tokens(&special_tokens);
Ok(self)
}
// And we can have a version that trains on files
pub fn train_from_files<T>(&mut self, trainer: &mut T, files: Vec<String>) -> Result<&mut Self>
where
T: Trainer<Model = M> + Sync,
{
let max_read = 1_000_000;
use crate::utils::iter::ResultShunt;
ResultShunt::process(
files.into_iter().flat_map(|filename| {
match File::open(filename) {
Ok(file) => {
let file = BufReader::with_capacity(max_read, file);
// We read new lines using this API instead of the Lines Iterator
// on purpose. We want to keep the `\n` and potential `\r` between each lines
// We use an iterator to be able to chain with par_bridge.
itertools::Either::Left(file.lines_with_ending())
}
Err(e) => itertools::Either::Right(std::iter::once(Err(e))),
}
}),
|iter| self.train(trainer, iter).map(|_| {}),
)??;
Ok(self)
} |
I like your proposal, my main concerns are:
Other than that it's great. And with the inversion of control proposed in #519 I think most of the problems will be gone.
It is the
I don't think we can have a python iterator to be
Yup that's nice. |
We can convert and stream everything in Python by doing something like this: #[args(trainer = "None")]
fn train_from_iterator(
&mut self,
trainer: Option<&mut PyTrainer>,
iterator: &PyAny,
) -> PyResult<()> {
let mut trainer =
trainer.map_or_else(|| self.tokenizer.get_model().get_trainer(), |t| t.clone());
let (send, recv) = std::sync::mpsc::sync_channel(256);
let mut sender = Some(send);
let iterator: PyIterator = iterator.iter()?;
crossbeam::thread::scope(|s| {
let _train_handle = s.spawn(|_| {
self.tokenizer
.train(&mut trainer, recv.into_iter())
.map(|_| {})
});
ResultShunt::process(iterator.map(|seq| seq?.extract::<&str>()), |iter| {
if let Some(send) = sender.take() {
for seq in iter {
send.send(seq)
.map_err(|e| exceptions::PyException::new_err(e.to_string()))?;
}
}
Ok(())
})?
})
.unwrap()
} The only problem I see with this is that it might get tricky to make it work with custom Python components, but It does not seem impossible though. It would incur some bad GIL locking contention but that's the case wherever we use custom Python components at the moment, and something we can improve in many places.
I don't think there's anything we can do to avoid using a type that Rust can understand here unfortunately. When we can cast to |
I'm very interested in this feature. What's the state of this? :) |
Superseded by #544 |
First attempt to train from memory.
Nice things:
feed_train_chunk
could start todirectly feed some inner struct of
Trainer
(pairs for BPE, sentencesfor Unigram) and bypass
word_counts
later.datasets
training is somehow as fast as from raw file (~20% slowdownon wikitext-103-raw ?).
Negatives:
large enough that parallelization actually achieves something.
Tokenizer
struct that does not really belong (itbelongs to the trainer and could be different from words, but this state
is temporary).
datasets
we are still having a slowdown because wehave to recreate a
string
chunk for a list of strings that getssplitted later.
overhead but really not the focus here.
{train|train_and_replace}_{from_files|from_buffer}
is starting to get a bit silly.Still missing:
Fixes #198