Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] train from memory. #512

Closed
wants to merge 1 commit into from
Closed

Conversation

Narsil
Copy link
Collaborator

@Narsil Narsil commented Nov 9, 2020

First attempt to train from memory.

Nice things:

  • It works
  • It's relatively simple
  • It enables shifting paradigm so that feed_train_chunk could start to
    directly feed some inner struct of Trainer (pairs for BPE, sentences
    for Unigram) and bypass word_counts later.
  • datasets training is somehow as fast as from raw file (~20% slowdown
    on wikitext-103-raw ?).

Negatives:

  • Puts burden on the caller to feed chunks
    large enough that parallelization actually achieves something.
  • Adds a new item in Tokenizer struct that does not really belong (it
    belongs to the trainer and could be different from words, but this state
    is temporary).
  • In the case of datasets we are still having a slowdown because we
    have to recreate a string chunk for a list of strings that gets
    splitted later.
  • We are accepting only strings, not raw bytes which does incur some
    overhead but really not the focus here.
  • {train|train_and_replace}_{from_files|from_buffer} is starting to get a bit silly.

Still missing:

  • Adding the docs
  • Rust tests.
  • Node bindings

Fixes #198

First attempt to train from memory.

Nice things:
- It works
- It's relatively simple
- It enables shifting paradigm so that `feed_train_chunk` could start to
  directly feed some inner struct of `Trainer` (pairs for BPE, sentences
for Unigram) and bypass `word_counts` later.
- `datasets` training is somehow as fast as from raw file (~20% slowdown
on wikitext-103-raw ?).

Negatives:
- Puts burden on the caller to feed chunks
large enough that parallelization actually achieves something.
- Adds a new item in `Tokenizer` struct that does not really belong (it
belongs to the trainer and could be different from words, but this state
is temporary).
- In the case of `datasets` we are still having a slowdown because we
have to recreate a `string` chunk for a list of strings that gets
splitted later.
- We are accepting only strings, not raw bytes which does incur some
overhead
@n1t0
Copy link
Member

n1t0 commented Nov 11, 2020

Nice! I've been thinking about your comments and experimented a bit, and I think I might have something that can avoid almost/all the negative points stated above, without complicating things too much and even simplifying some others.

It enables shifting paradigm so that feed_train_chunk could start to
directly feed some inner struct of Trainer (pairs for BPE, sentences
for Unigram) and bypass word_counts later.

I'm not sure to understand exactly how you would feed types specific to each Trainer without being too specific about each of them in the Tokenizer. I think it should stay the responsibility of each Trainer to process the data, and as such, having them process a bunch of &str seems the most straightforward way to do it for me.

Puts burden on the caller to feed chunks
large enough that parallelization actually achieves something.

I think the way to avoid this is to actually parallelize over the iterator instead of each element. We could absolutely be training over an array with small sentences, and I even feel like it will be the case most of the time. Plus, we can expect to get at least as many elements in the iterator as the number of threads in the pool, while we can't really expect each string to be long enough and to include some \n.

  • Adds a new item in Tokenizer struct that does not really belong (it
    belongs to the trainer and could be different from words, but this state
    is temporary).
  • In the case of datasets we are still having a slowdown because we
    have to recreate a string chunk for a list of strings that gets
    splitted later.

These are not necessary with the solution below. datasets can probably directly feed a NumPy array, without the need for any sort of pre-processing.

{train|train_and_replace}_{from_files|from_buffer} is starting to get a bit silly.

With #519, we remove train_and_replace to keep only train, that will be training the attached Model in-place. Makes everything easier.

So in order to achieve this, I think we can have something along the lines of:

pub trait Trainer {
    [...]

    // No need to give anything else than the Model to train
    fn train(&self, model: &mut Self::Model) -> Result<Vec<AddedToken>>;

    // We feed the `Trainer` with an iterator, and give him a way to pre-process each sequence
    fn feed<I, S, F>(&mut self, iterator: I, process: F) -> Result<()>
    where
        I: Iterator<Item = S> + Send,
        S: AsRef<str> + Send,
        F: Fn(&str) -> Result<Vec<String>> + Sync;
}

// Example of impl to handle word counts. We can factorize the inside to easily
// re-use it in the various Trainer for now.
fn feed<I, S, F>(&mut self, iterator: I, process: F) -> Result<()>
where
    I: Iterator<Item = S> + Send,
    S: AsRef<str> + Send,
    F: Fn(&str) -> Result<Vec<String>> + Sync,
{
    let words: Result<HashMap<String, u32>> = iterator
        .maybe_par_bridge()
        .map(|sequence| {
            let words = process(sequence.as_ref())?;
            let mut map = HashMap::new();
            for word in words {
                map.entry(word).and_modify(|c| *c += 1).or_insert(1);
            }
            Ok(map)
        })
        .reduce(
            || Ok(HashMap::new()),
            |acc, ws| {
                let mut acc = acc?;
                for (k, v) in ws? {
                    acc.entry(k).and_modify(|c| *c += v).or_insert(v);
                }
                Ok(acc)
            },
        );

    self.words = words?;
    Ok(())
}

// On the Tokenizer, the default train now deals with iterators directly
pub fn train<T, I, S>(&mut self, trainer: &mut T, sequences: I) -> Result<&mut Self>
where
    T: Trainer<Model = M> + Sync,
    I: Iterator<Item = S> + Send,
    S: AsRef<str> + Send,
{
    let (lower, upper) = sequences.size_hint();
    let len = upper.unwrap_or(lower) as u64;
    let progress = if trainer.should_show_progress() {
        let progress = ProgressBar::new(len);
        progress.set_style(
            ProgressStyle::default_bar()
                .template("[{elapsed_precise}] {msg:<40!} {wide_bar} {pos:<9!}/{len:>9!}"),
        );
        progress.set_message("Pre-processing sequences");
        progress.set_draw_delta(len / 100); // Redraw only every 2%
        Some(progress)
    } else {
        None
    };

    trainer.feed(
        sequences.map(|s| {
            if let Some(progress) = &progress {
                progress.inc(1)
            }
            s
        }),
        |seq| {
            let normalized = self.do_normalize(seq.as_ref())?;
            let pre_tokenized = self.do_pre_tokenize(normalized)?;
            Ok(pre_tokenized
                .get_splits(OffsetReferential::Original, OffsetType::Byte)
                .into_iter()
                .map(|(s, _, _)| s.to_owned())
                .collect())
        },
    )?;
    if let Some(pbar) = progress {
        pbar.finish();
    }

    // let words = self.word_count(trainer, files)?;
    let special_tokens = trainer.train(&mut self.model)?;
    self.add_special_tokens(&special_tokens);

    Ok(self)
}

// And we can have a version that trains on files
pub fn train_from_files<T>(&mut self, trainer: &mut T, files: Vec<String>) -> Result<&mut Self>
where
    T: Trainer<Model = M> + Sync,
{
    let max_read = 1_000_000;
    use crate::utils::iter::ResultShunt;
    ResultShunt::process(
        files.into_iter().flat_map(|filename| {
            match File::open(filename) {
                Ok(file) => {
                    let file = BufReader::with_capacity(max_read, file);
                    // We read new lines using this API instead of the Lines Iterator
                    // on purpose. We want to keep the `\n` and potential `\r` between each lines
                    // We use an iterator to be able to chain with par_bridge.
                    itertools::Either::Left(file.lines_with_ending())
                }
                Err(e) => itertools::Either::Right(std::iter::once(Err(e))),
            }
        }),
        |iter| self.train(trainer, iter).map(|_| {}),
    )??;
    Ok(self)
}

@Narsil
Copy link
Collaborator Author

Narsil commented Nov 11, 2020

Nice! I've been thinking about your comments and experimented a bit, and I think I might have something that can avoid almost/all the negative points stated above, without complicating things too much and even simplifying some others.

I like your proposal, my main concerns are:

  1. How feasible/simple it is to implement this Send Iterator. I had in mind it would need to be Sync. Cloning the strings is OK because we need to cast them to Rust strings anyway so only my 2nd point remains.
  2. Be careful about the accumulating costs of casting from native string to rust string for each sentence might incur a bigger cost that we want (to be checked)

Other than that it's great. And with the inversion of control proposed in #519 I think most of the problems will be gone.

I'm not sure to understand exactly how you would feed types specific to each Trainer without being too specific about each of them in the Tokenizer. I think it should stay the responsibility of each Trainer to process the data, and as such, having them process a bunch of &str seems the most straightforward way to do it for me.

It is the tokenizer responsability to process the data (pre_tokenize and normalize). But yes it's the trainer's responsability to store the struct it needs to do that actual training, and that what I was referring to with that sentence.

Puts burden on the caller to feed chunks
large enough that parallelization actually achieves something.

I think the way to avoid this is to actually parallelize over the iterator instead of each element. We could absolutely be training over an array with small sentences, and I even feel like it will be the case most of the time. Plus, we can expect to get at least as many elements in the iterator as the number of threads in the pool, while we can't really expect each string to be long enough and to include some \n.

I don't think we can have a python iterator to be Send, it's stateful by nature.
The problem with that solution is that for every language we need to find a way to make a natural iterator become a Send iterator in Rust. That's why I thought moving large String chunks around felt more natural (We don't have to create that Send structure for each langage). We also have to take into account the cost of Casting a binding language string into a Rust string, the more we have to do that slower it gets (we probably need to check if this is a real bottleneck or not).

With #519, we remove train_and_replace to keep only train, that will be training the attached Model in-place. Makes everything easier.

Yup that's nice.

@n1t0
Copy link
Member

n1t0 commented Nov 12, 2020

I don't think we can have a python iterator to be Send, it's stateful by nature.
The problem with that solution is that for every language we need to find a way to make a natural iterator become a Send iterator in Rust. That's why I thought moving large String chunks around felt more natural (We don't have to create that Send structure for each langage).

We can convert and stream everything in Python by doing something like this:

#[args(trainer = "None")]
fn train_from_iterator(
    &mut self,
    trainer: Option<&mut PyTrainer>,
    iterator: &PyAny,
) -> PyResult<()> {
    let mut trainer =
        trainer.map_or_else(|| self.tokenizer.get_model().get_trainer(), |t| t.clone());
    let (send, recv) = std::sync::mpsc::sync_channel(256);
    let mut sender = Some(send);
    let iterator: PyIterator = iterator.iter()?;

    crossbeam::thread::scope(|s| {
        let _train_handle = s.spawn(|_| {
            self.tokenizer
                .train(&mut trainer, recv.into_iter())
                .map(|_| {})
        });

        ResultShunt::process(iterator.map(|seq| seq?.extract::<&str>()), |iter| {
            if let Some(send) = sender.take() {
                for seq in iter {
                    send.send(seq)
                        .map_err(|e| exceptions::PyException::new_err(e.to_string()))?;
                }
            }
            Ok(())
        })?
    })
    .unwrap()
}

The only problem I see with this is that it might get tricky to make it work with custom Python components, but It does not seem impossible though. It would incur some bad GIL locking contention but that's the case wherever we use custom Python components at the moment, and something we can improve in many places.

We also have to take into account the cost of Casting a binding language string into a Rust string, the more we have to do that slower it gets (we probably need to check if this is a real bottleneck or not).

I don't think there's anything we can do to avoid using a type that Rust can understand here unfortunately. When we can cast to &str then it's great, otherwise I don't think we have another choice. That being said, even when using String I don't think it would be a bottleneck. Reading a 500MB file into a String takes a few milliseconds, while processing it takes order of magnitude more.

@marcglobality
Copy link

I'm very interested in this feature. What's the state of this? :)

@n1t0
Copy link
Member

n1t0 commented Nov 28, 2020

Superseded by #544

@n1t0 n1t0 closed this Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Training a model from in-memory data
3 participants