[WIP] train from memory. #512

Narsil · 2020-11-09T15:49:00Z

First attempt to train from memory.

Nice things:

It works
It's relatively simple
It enables shifting paradigm so that feed_train_chunk could start to
directly feed some inner struct of Trainer (pairs for BPE, sentences
for Unigram) and bypass word_counts later.
datasets training is somehow as fast as from raw file (~20% slowdown
on wikitext-103-raw ?).

Negatives:

Puts burden on the caller to feed chunks
large enough that parallelization actually achieves something.
Adds a new item in Tokenizer struct that does not really belong (it
belongs to the trainer and could be different from words, but this state
is temporary).
In the case of datasets we are still having a slowdown because we
have to recreate a string chunk for a list of strings that gets
splitted later.
We are accepting only strings, not raw bytes which does incur some
overhead but really not the focus here.
{train|train_and_replace}_{from_files|from_buffer} is starting to get a bit silly.

Still missing:

Adding the docs
Rust tests.
Node bindings

Fixes #198

First attempt to train from memory. Nice things: - It works - It's relatively simple - It enables shifting paradigm so that `feed_train_chunk` could start to directly feed some inner struct of `Trainer` (pairs for BPE, sentences for Unigram) and bypass `word_counts` later. - `datasets` training is somehow as fast as from raw file (~20% slowdown on wikitext-103-raw ?). Negatives: - Puts burden on the caller to feed chunks large enough that parallelization actually achieves something. - Adds a new item in `Tokenizer` struct that does not really belong (it belongs to the trainer and could be different from words, but this state is temporary). - In the case of `datasets` we are still having a slowdown because we have to recreate a `string` chunk for a list of strings that gets splitted later. - We are accepting only strings, not raw bytes which does incur some overhead

n1t0 · 2020-11-11T02:35:52Z

Nice! I've been thinking about your comments and experimented a bit, and I think I might have something that can avoid almost/all the negative points stated above, without complicating things too much and even simplifying some others.

It enables shifting paradigm so that feed_train_chunk could start to
directly feed some inner struct of Trainer (pairs for BPE, sentences
for Unigram) and bypass word_counts later.

I'm not sure to understand exactly how you would feed types specific to each Trainer without being too specific about each of them in the Tokenizer. I think it should stay the responsibility of each Trainer to process the data, and as such, having them process a bunch of &str seems the most straightforward way to do it for me.

Puts burden on the caller to feed chunks
large enough that parallelization actually achieves something.

I think the way to avoid this is to actually parallelize over the iterator instead of each element. We could absolutely be training over an array with small sentences, and I even feel like it will be the case most of the time. Plus, we can expect to get at least as many elements in the iterator as the number of threads in the pool, while we can't really expect each string to be long enough and to include some \n.

Adds a new item in Tokenizer struct that does not really belong (it
belongs to the trainer and could be different from words, but this state
is temporary).

In the case of datasets we are still having a slowdown because we
have to recreate a string chunk for a list of strings that gets
splitted later.

These are not necessary with the solution below. datasets can probably directly feed a NumPy array, without the need for any sort of pre-processing.

{train|train_and_replace}_{from_files|from_buffer} is starting to get a bit silly.

With #519, we remove train_and_replace to keep only train, that will be training the attached Model in-place. Makes everything easier.

So in order to achieve this, I think we can have something along the lines of:

pub trait Trainer {
    [...]

    // No need to give anything else than the Model to train
    fn train(&self, model: &mut Self::Model) -> Result<Vec<AddedToken>>;

    // We feed the `Trainer` with an iterator, and give him a way to pre-process each sequence
    fn feed<I, S, F>(&mut self, iterator: I, process: F) -> Result<()>
    where
        I: Iterator<Item = S> + Send,
        S: AsRef<str> + Send,
        F: Fn(&str) -> Result<Vec<String>> + Sync;
}

// Example of impl to handle word counts. We can factorize the inside to easily
// re-use it in the various Trainer for now.
fn feed<I, S, F>(&mut self, iterator: I, process: F) -> Result<()>
where
    I: Iterator<Item = S> + Send,
    S: AsRef<str> + Send,
    F: Fn(&str) -> Result<Vec<String>> + Sync,
{
    let words: Result<HashMap<String, u32>> = iterator
        .maybe_par_bridge()
        .map(|sequence| {
            let words = process(sequence.as_ref())?;
            let mut map = HashMap::new();
            for word in words {
                map.entry(word).and_modify(|c| *c += 1).or_insert(1);
            }
            Ok(map)
        })
        .reduce(
            || Ok(HashMap::new()),
            |acc, ws| {
                let mut acc = acc?;
                for (k, v) in ws? {
                    acc.entry(k).and_modify(|c| *c += v).or_insert(v);
                }
                Ok(acc)
            },
        );

    self.words = words?;
    Ok(())
}

// On the Tokenizer, the default train now deals with iterators directly
pub fn train<T, I, S>(&mut self, trainer: &mut T, sequences: I) -> Result<&mut Self>
where
    T: Trainer<Model = M> + Sync,
    I: Iterator<Item = S> + Send,
    S: AsRef<str> + Send,
{
    let (lower, upper) = sequences.size_hint();
    let len = upper.unwrap_or(lower) as u64;
    let progress = if trainer.should_show_progress() {
        let progress = ProgressBar::new(len);
        progress.set_style(
            ProgressStyle::default_bar()
                .template("[{elapsed_precise}] {msg:<40!} {wide_bar} {pos:<9!}/{len:>9!}"),
        );
        progress.set_message("Pre-processing sequences");
        progress.set_draw_delta(len / 100); // Redraw only every 2%
        Some(progress)
    } else {
        None
    };

    trainer.feed(
        sequences.map(|s| {
            if let Some(progress) = &progress {
                progress.inc(1)
            }
            s
        }),
        |seq| {
            let normalized = self.do_normalize(seq.as_ref())?;
            let pre_tokenized = self.do_pre_tokenize(normalized)?;
            Ok(pre_tokenized
                .get_splits(OffsetReferential::Original, OffsetType::Byte)
                .into_iter()
                .map(|(s, _, _)| s.to_owned())
                .collect())
        },
    )?;
    if let Some(pbar) = progress {
        pbar.finish();
    }

    // let words = self.word_count(trainer, files)?;
    let special_tokens = trainer.train(&mut self.model)?;
    self.add_special_tokens(&special_tokens);

    Ok(self)
}

// And we can have a version that trains on files
pub fn train_from_files<T>(&mut self, trainer: &mut T, files: Vec<String>) -> Result<&mut Self>
where
    T: Trainer<Model = M> + Sync,
{
    let max_read = 1_000_000;
    use crate::utils::iter::ResultShunt;
    ResultShunt::process(
        files.into_iter().flat_map(|filename| {
            match File::open(filename) {
                Ok(file) => {
                    let file = BufReader::with_capacity(max_read, file);
                    // We read new lines using this API instead of the Lines Iterator
                    // on purpose. We want to keep the `\n` and potential `\r` between each lines
                    // We use an iterator to be able to chain with par_bridge.
                    itertools::Either::Left(file.lines_with_ending())
                }
                Err(e) => itertools::Either::Right(std::iter::once(Err(e))),
            }
        }),
        |iter| self.train(trainer, iter).map(|_| {}),
    )??;
    Ok(self)
}

Narsil · 2020-11-11T10:38:41Z

Nice! I've been thinking about your comments and experimented a bit, and I think I might have something that can avoid almost/all the negative points stated above, without complicating things too much and even simplifying some others.

I like your proposal, my main concerns are:

~~How feasible/simple it is to implement this Send Iterator.~~ I had in mind it would need to be Sync. Cloning the strings is OK because we need to cast them to Rust strings anyway so only my 2nd point remains.
Be careful about the accumulating costs of casting from native string to rust string for each sentence might incur a bigger cost that we want (to be checked)

Other than that it's great. And with the inversion of control proposed in #519 I think most of the problems will be gone.

I'm not sure to understand exactly how you would feed types specific to each Trainer without being too specific about each of them in the Tokenizer. I think it should stay the responsibility of each Trainer to process the data, and as such, having them process a bunch of &str seems the most straightforward way to do it for me.

It is the tokenizer responsability to process the data (pre_tokenize and normalize). But yes it's the trainer's responsability to store the struct it needs to do that actual training, and that what I was referring to with that sentence.

Puts burden on the caller to feed chunks
large enough that parallelization actually achieves something.

I think the way to avoid this is to actually parallelize over the iterator instead of each element. We could absolutely be training over an array with small sentences, and I even feel like it will be the case most of the time. Plus, we can expect to get at least as many elements in the iterator as the number of threads in the pool, while we can't really expect each string to be long enough and to include some \n.

I don't think we can have a python iterator to be Send, it's stateful by nature.
The problem with that solution is that for every language we need to find a way to make a natural iterator become a Send iterator in Rust. That's why I thought moving large String chunks around felt more natural (We don't have to create that Send structure for each langage). We also have to take into account the cost of Casting a binding language string into a Rust string, the more we have to do that slower it gets (we probably need to check if this is a real bottleneck or not).

With #519, we remove train_and_replace to keep only train, that will be training the attached Model in-place. Makes everything easier.

Yup that's nice.

n1t0 · 2020-11-12T15:50:35Z

I don't think we can have a python iterator to be Send, it's stateful by nature.
The problem with that solution is that for every language we need to find a way to make a natural iterator become a Send iterator in Rust. That's why I thought moving large String chunks around felt more natural (We don't have to create that Send structure for each langage).

We can convert and stream everything in Python by doing something like this:

#[args(trainer = "None")]
fn train_from_iterator(
    &mut self,
    trainer: Option<&mut PyTrainer>,
    iterator: &PyAny,
) -> PyResult<()> {
    let mut trainer =
        trainer.map_or_else(|| self.tokenizer.get_model().get_trainer(), |t| t.clone());
    let (send, recv) = std::sync::mpsc::sync_channel(256);
    let mut sender = Some(send);
    let iterator: PyIterator = iterator.iter()?;

    crossbeam::thread::scope(|s| {
        let _train_handle = s.spawn(|_| {
            self.tokenizer
                .train(&mut trainer, recv.into_iter())
                .map(|_| {})
        });

        ResultShunt::process(iterator.map(|seq| seq?.extract::<&str>()), |iter| {
            if let Some(send) = sender.take() {
                for seq in iter {
                    send.send(seq)
                        .map_err(|e| exceptions::PyException::new_err(e.to_string()))?;
                }
            }
            Ok(())
        })?
    })
    .unwrap()
}

The only problem I see with this is that it might get tricky to make it work with custom Python components, but It does not seem impossible though. It would incur some bad GIL locking contention but that's the case wherever we use custom Python components at the moment, and something we can improve in many places.

We also have to take into account the cost of Casting a binding language string into a Rust string, the more we have to do that slower it gets (we probably need to check if this is a real bottleneck or not).

I don't think there's anything we can do to avoid using a type that Rust can understand here unfortunately. When we can cast to &str then it's great, otherwise I don't think we have another choice. That being said, even when using String I don't think it would be a bottleneck. Reading a 500MB file into a String takes a few milliseconds, while processing it takes order of magnitude more.

marcglobality · 2020-11-21T17:18:51Z

I'm very interested in this feature. What's the state of this? :)

n1t0 · 2020-11-28T17:30:04Z

Superseded by #544

This was referenced Nov 10, 2020

Feature Request: Example of how best to use with datasets #479

Closed

Feature Request: make tokenizers trainable on arrow files #501

Closed

Feature Request: Train using a text iterator #478

Closed

Narsil requested a review from n1t0 November 10, 2020 08:39

Narsil marked this pull request as draft November 10, 2020 08:39

n1t0 closed this Nov 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] train from memory. #512

[WIP] train from memory. #512

Narsil commented Nov 9, 2020 •

edited

Loading

n1t0 commented Nov 11, 2020

Narsil commented Nov 11, 2020 •

edited

Loading

n1t0 commented Nov 12, 2020 •

edited

Loading

marcglobality commented Nov 21, 2020

n1t0 commented Nov 28, 2020

[WIP] train from memory. #512

[WIP] train from memory. #512

Conversation

Narsil commented Nov 9, 2020 • edited Loading

n1t0 commented Nov 11, 2020

Narsil commented Nov 11, 2020 • edited Loading

n1t0 commented Nov 12, 2020 • edited Loading

marcglobality commented Nov 21, 2020

n1t0 commented Nov 28, 2020

Narsil commented Nov 9, 2020 •

edited

Loading

Narsil commented Nov 11, 2020 •

edited

Loading

n1t0 commented Nov 12, 2020 •

edited

Loading