nlp.rehearse does not work #13041

apparatAndrii · 2023-10-04T12:19:46Z

apparatAndrii
Oct 4, 2023

I am trying to do dynanic model training. For this I need to retrain my model every time when for eg. employees need to add new data to model. Basically my app extracts data from docs and if something wrong - people check what is wrong and model updates

I have 3 solutions in my mind

To do one json file or db with examples, always add new examples to it and retrain whole model from 0 every time (I guess it is the worse solution)
To create new ner pipeline every time I train model (but I think that model will be executing slower, because it is posible to have 100-200 new pipelines)
Best solution what I found - use pseudo-rehearsal

Here is my code that will retrain model

optimizer = model.resume_training()

for itn in range(1000):
    random.shuffle(data)
    losses = {}
    for item in data:
        doc = model.make_doc(item['text'])
        ents = []
        for annotation in item['annotations']:
            start = annotation.get('start')
            end = annotation.get('end')
            label = annotation.get('label')
            if start is not None and end is not None and label is not None:
                span = doc.char_span(start, end, label=label)
                if span is not None:
                    ents.append(span)
        doc.ents = ents
        example = Example.from_dict(doc, {"entities": ents} )
        model.rehearse([example], sgd=optimizer, losses=losses)

If I am using model.rehearse my model does not update at all, but it is successfully processed

When I am trying to use model.update - all works, but now I am getting problem called "chatastrophic forgetting"

Am I doing something wrong, or this feature can not do what I need? Thank you!

Answered by adrianeboyd

Oct 6, 2023

Even though it's been available in spacy for a long time, the rehearse feature is still experimental and it's not something that we've been working on actively recently, so it's definitely possible to run into bugs. I remember that the basics were all updated not too long ago in #10347 with some extended tests, which may be a good starting place for understanding how it's intended to work: https://github.com/explosion/spaCy/blob/be29216fe2451adead7f56ccd1db494fd8549dae/spacy/tests/training/test_rehearse.py

However, this type of pseudo-rehearsal is intended for cases where you don't have access to the original training data. Since it sounds like you do still have access to all the training…

View full answer

adrianeboyd · 2023-10-06T09:08:56Z

adrianeboyd
Oct 6, 2023

Even though it's been available in spacy for a long time, the rehearse feature is still experimental and it's not something that we've been working on actively recently, so it's definitely possible to run into bugs. I remember that the basics were all updated not too long ago in #10347 with some extended tests, which may be a good starting place for understanding how it's intended to work: https://github.com/explosion/spaCy/blob/be29216fe2451adead7f56ccd1db494fd8549dae/spacy/tests/training/test_rehearse.py

However, this type of pseudo-rehearsal is intended for cases where you don't have access to the original training data. Since it sounds like you do still have access to all the training data, I think it would be easier to avoid catastrophic forgetting by always updating the model with a suitable mixture of old and new training instances.

In general, my guess is that you'd see the best performance by retraining from scratch on all the data, but if this is too costly, then updating using a subset of the data (with some suitable sampling method) should help you avoid catastrophic forgetting.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nlp.rehearse does not work #13041

{{title}}

Replies: 1 comment

{{title}}

Select a reply

nlp.rehearse does not work #13041

apparatAndrii Oct 4, 2023

Replies: 1 comment

adrianeboyd Oct 6, 2023

apparatAndrii
Oct 4, 2023

adrianeboyd
Oct 6, 2023