Are Span Categorizer predictions repeatable? #9294
-
Doing some performance metrics, I discovered they my Span Categorizer model randomly reports DIFFERENT predictions for the identical document request. At a rate of about 5 to 1, the model is returning either 4 or 7 predictions, with scores either: Note that repeated scores are almost identical, there are just two different sets of them. I understand that training, with all the randomization and statistical nature, will yield (slightly) different models for each training run. This is using a ['tok2vec', 'ner'] pipeline, model trained on some 6550 documents averaging about 6 'annotated spans' each, with resulting sores around 0.86. The threshold is at default (0.5). And reported 'spans' are what I want to see - just that more frequently some spans 'get lost'. Is there an explanation for the 'random' result (or do I need to look for some bug)? |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 3 replies
-
Hi, could you please refrain from using ALL-CAPS words? They come across as shouting/rude. Thanks! |
Beta Was this translation helpful? Give feedback.
-
We are aware that there can be very small variations in the weights/predictions when training on GPU, but we haven't recently seen any reports of predictions that vary for the exact same model. I've tried to reproduce the behaviour you're seeing, but I couldn't - I'm always getting the exact same score predictions. If you can provide us with example data & code & config & training command to reproduce the issue, we can look into it further. |
Beta Was this translation helpful? Give feedback.
-
Update: It Looks like I am incorrect in 'blaming' the spancat.
So the problem is not 'spancat' specific. It is probably something related to my data/domain: I am looking for two labels, NAME_FROM, NAME_TO in text (reduced to relevant portion): FOR A VALUABLE CONSIDERATION, receipt of which Is hereby acknowledged, . And returned entities are either: With my limited knowledge of how NER/spancat work, it seems like some very subtle/ambiguous weight is switching prediction results. I know my domain is hinging on Semantic Role Labeling (SRL). However, in practice the prefix/suffix text is pretty static, often just 'captions'. In this particular case, some predicted entities (names) will be the same for the LABEL_FROM, LABEL_TO. This never happens in my training data, I explicitly reject such documents - but in 'live production' this will happen. To sum it up, I am not sure where to go with this. Perhaps a remedy will be using some of the SRL contributing additional attributes. |
Beta Was this translation helpful? Give feedback.
-
I must admit I found an error on my side. So a big apology for wasting time (and Spacy team attention) - I must admit that the culprit is me (my data generation). |
Beta Was this translation helpful? Give feedback.
-
I was badly bitten by Python ‘socketserver’. I am sending my data to a Python ‘prediction server’ using TCP sockets, and I was checking the data I am _sending_. Only when I started checking the data I am _receiving_, I discovered that in that special case, Python socket.recv(1024*1024) was quietly, randomly dropping half of my ~4800 bytes.
The code is faithfully copied from Python examples. I had to change that code not to call socket.recv with more than a 4k buffer (I found some warning hints when I researched the problem).
As predominantly Java coder, I am more used to things throwing exceptions or crashing VM… than quietly cheating.
Learning never ends.
|
Beta Was this translation helpful? Give feedback.
I must admit I found an error on my side.
The reason for (both) models 'behaving nondeterministic' is that my data is not deterministic. At a first sight, yes, submitted data is the same. But in detail it turned out not to be the case.
So a big apology for wasting time (and Spacy team attention) - I must admit that the culprit is me (my data generation).
Thanks for the patience.