recall and f1-score is pretty low (~0.65) for unseen instances of custom entity on transfer-learned model. #9283
-
I have a dataset annotated with custom entity. Each data point is long text (not a single sentence), possibly with multiple entities. The corpus size is around 1200 texts. This corpus divided into train-validation-test set as follows:
I'm using transfer learning with pretrained en_core_web_sm model. When i train model, the precision , recall and f1-score values reach till 1 for seen instances of the entity in the validation set , but it has very poor recall on unseen instances. When predictions made on the test set, model has very poor performance, especially on unseen instances (~0.55 recall and ~0.68 f1 score). Are there any recommendations to improve the performance of the model (especially for unseen instances) ? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Can you give a specific example of what kind of entities you're trying to recognize? Typical solutions for poor generalization are to get more training data or to use data augmentation to make your model more robust. If your model is failing to generalize it's usually because it doesn't have enough data to find patterns. Your train, validation, and test sets shouldn't have any overlap - it's OK if there's some, but being different datasets is the point. The fact that the model can perfectly recall stuff from the train set isn't interesting. Also, depending on what you're trying to recognize 70F1 isn't that bad. There's lots of NER applications where that's low but maybe you have a hard case. Hard to say without more details. |
Beta Was this translation helpful? Give feedback.
Can you give a specific example of what kind of entities you're trying to recognize?
Typical solutions for poor generalization are to get more training data or to use data augmentation to make your model more robust. If your model is failing to generalize it's usually because it doesn't have enough data to find patterns.
Your train, validation, and test sets shouldn't have any overlap - it's OK if there's some, but being different datasets is the point. The fact that the model can perfectly recall stuff from the train set isn't interesting.
Also, depending on what you're trying to recognize 70F1 isn't that bad. There's lots of NER applications where that's low but maybe you have a hard ca…