Are Span Categorizer predictions repeatable? #9294

mbrunecky · 2021-09-27T01:05:24Z

mbrunecky
Sep 27, 2021

Doing some performance metrics, I discovered they my Span Categorizer model randomly reports DIFFERENT predictions for the identical document request.

At a rate of about 5 to 1, the model is returning either 4 or 7 predictions, with scores either:
scores [0.9987337 0.9896487 0.9851459 0.94773114]
or
scores [0.9987337 0.9896554 0.9851459 0.94773114 0.7600302 0.7575082 0.9786375 ]

Note that repeated scores are almost identical, there are just two different sets of them. I understand that training, with all the randomization and statistical nature, will yield (slightly) different models for each training run.
But I was assuming that the prediction is not 'random', that using the same model and repeating the same predict() request will yield the same result.

This is using a ['tok2vec', 'ner'] pipeline, model trained on some 6550 documents averaging about 6 'annotated spans' each, with resulting sores around 0.86. The threshold is at default (0.5). And reported 'spans' are what I want to see - just that more frequently some spans 'get lost'.

Is there an explanation for the 'random' result (or do I need to look for some bug)?
(or is there a way to make the model always report the 'more' predictions?)

Answered by mbrunecky

Sep 28, 2021

I must admit I found an error on my side.
The reason for (both) models 'behaving nondeterministic' is that my data is not deterministic. At a first sight, yes, submitted data is the same. But in detail it turned out not to be the case.

So a big apology for wasting time (and Spacy team attention) - I must admit that the culprit is me (my data generation).
Thanks for the patience.

View full answer

svlandeg · 2021-09-27T19:15:14Z

svlandeg
Sep 27, 2021
Maintainer

Hi, could you please refrain from using ALL-CAPS words? They come across as shouting/rude. Thanks!

0 replies

svlandeg · 2021-09-27T19:28:52Z

svlandeg
Sep 27, 2021
Maintainer

We are aware that there can be very small variations in the weights/predictions when training on GPU, but we haven't recently seen any reports of predictions that vary for the exact same model. I've tried to reproduce the behaviour you're seeing, but I couldn't - I'm always getting the exact same score predictions.

If you can provide us with example data & code & config & training command to reproduce the issue, we can look into it further.

1 reply

mbrunecky Sep 27, 2021
Author

Thank you for looking into this for me (and sorry for using the caps - old habits).
The training differences are easily explained by all the randomization efforts going into the process. and I am not surprised to see a percent or two differences.
But not knowing how the 'spancat' internals work, I was wondering that perhaps there is some randomizing aspect to it.

I am attaching my config_exp_1.cfg in the zip file DetailsFor_9294.zip attached as a separate answer - I hope that is OK.
I am also using my variation of span suggester (spacy_functions.py, also in the zip). That is because my 'spans' are names (persons or business entities), and as such always start in capital letters or a digit (I could probably add more constraints). Adding this constraint substantially reduces the number of span candidates, and hence helps performance. On the other hand, even though my spans are typically around 4 tokens, some of my spans are over 14 tokens. I 'gave up' on predicting spans over 14 tokens (<1% overall). But in this specific case, the spans are 3 to 5 tokens.

I am using:
spaCy version 3.1.1 (hand-patched adding polk's code reporting confidences)
Platform Windows-10-10.0.19041-SP0
Python version 3.8.1

I can upload my trained model and the document showing the problem, but the model size (835 MB) seems to be an overkill.

mbrunecky · 2021-09-27T21:33:16Z

mbrunecky
Sep 27, 2021
Author

DetailsFor_9294.zip

1 reply

mbrunecky Sep 27, 2021
Author

The training command is spacy (not using GPU):
python -m spacy train C:\Work\ML\Spacy3\dataset\ca_placer_dee_8k_sel_12/config_exp_1.cfg --output C:\Work\ML\Spacy3\dataset\ca_placer_dee_8k_sel_12/model_exp_1 --paths.train C:\Work\ML\Spacy3\dataset\ca_placer_dee_8k_sel_12/train --paths.dev C:\Work\ML\Spacy3\dataset\ca_placer_dee_8k_sel_12/valid --code ./spacy_functions.py

I am packaging training/validation data as DocBin, 100 documents a piece, 6650 documents (average ~2k characters).

The 'predict' code fragment is:

nlp = spacy.load(model) 

if nlp.has_pipe('spancat'):
            # Use 'spancat' which returns spans (not entities)
            doc = nlp(text)
            # print("span groups", doc.spans) 
            for key in doc.spans:
                # Getting the scores (Git #8855 incorporated into spancat.py) attached as attr "scores"
                spans  = doc.spans[key]
                scores = spans.attrs.get("scores")
                print("scores", scores, "spans", spans)
                for i in range(len(spans)):
                    span  = spans[i]
                    score = 1.0 if scores is None else scores[i]
                    pred_list.append(Prediction(doc_id, span.label_, span.start_char, span.end_char, span.start_char, span.text, score))

Regarding the model upload, perhaps I could upload the model without the 'vectors' (802.536GB), which come from eng_core_web_lg and I believe are unchanged.

mbrunecky · 2021-09-27T23:18:52Z

mbrunecky
Sep 27, 2021
Author

Update: It Looks like I am incorrect in 'blaming' the spancat.
I switched to a transformer based NER model of the same data and I am getting a similar behavior.
In terms of my prediction log, I am getting (look for the 'N predictions):

No GPU specified, using CPU only
Loaded model C:\Work\ML\Spacy3\dataset\ca_placer_dee_8k_sel_12\model_trn\model-best-1 in 2.7187976837158203 seconds
Model pipeline: ['transformer', 'ner']
Starting vocabulary size  761
Server loop running in thread: Thread-1
- req  0 docId 6287779 '6287779+00012 PLACER, County Rec' ret 4 predictions, vocab 925 speed 0.774173909604591
- req  1 docId 6287779 '6287779+00012 PLACER, County Rec' ret 7 predictions, vocab 1117 speed 0.47676685528877455
- req  2 docId 6287779 '6287779+00012 PLACER, County Rec' ret 7 predictions, vocab 1117 speed 0.47275732725094527
- req  3 docId 6287779 '6287779+00012 PLACER, County Rec' ret 4 predictions, vocab 1117 speed 0.6760660785108035
- req  4 docId 6287779 '6287779+00012 PLACER, County Rec' ret 4 predictions, vocab 1117 speed 0.6108734878891428
- req  5 docId 6287779 '6287779+00012 PLACER, County Rec' ret 4 predictions, vocab 1117 speed 0.6652758492511714
- req  6 docId 6287779 '6287779+00012 PLACER, County Rec' ret 7 predictions, vocab 1117 speed 0.47275396493765026
- req  7 docId 6287779 '6287779+00012 PLACER, County Rec' ret 4 predictions, vocab 1117 speed 0.6978766367473722
- req  8 docId 6287779 '6287779+00012 PLACER, County Rec' ret 4 predictions, vocab 1117 speed 0.6541791531229385
- req  9 docId 6287779 '6287779+00012 PLACER, County Rec' ret 7 predictions, vocab 1117 speed 0.40857584048540163
Total: 10 requests, avg speed: 0.5909299103088792 sec per 1000 characters
- req  10 docId 6287779 '6287779+00012 PLACER, County Rec' ret 7 predictions, vocab 1117 speed 0.4807919722336989
- req  11 docId 6287779 '6287779+00012 PLACER, County Rec' ret 7 predictions, vocab 1117 speed 0.436703791985145
- req  12 docId 6287779 '6287779+00012 PLACER, County Rec' ret 4 predictions, vocab 1117 speed 0.6652266015893333
- req  13 docId 6287779 '6287779+00012 PLACER, County Rec' ret 4 predictions, vocab 1117 speed 0.6651512327014526
- req  14 docId 6287779 '6287779+00012 PLACER, County Rec' ret 4 predictions, vocab 1117 speed 0.6650836835436611
- req  15 docId 6287779 '6287779+00012 PLACER, County Rec' ret 7 predictions, vocab 1117 speed 0.46474572939750475
- req  16 docId 6287779 '6287779+00012 PLACER, County Rec' ret 4 predictions, vocab 1117 speed 0.6539540447227121
-

So the problem is not 'spancat' specific. It is probably something related to my data/domain:

I am looking for two labels, NAME_FROM, NAME_TO in text (reduced to relevant portion):

FOR A VALUABLE CONSIDERATION, receipt of which Is hereby acknowledged, .
Jon J. Virtue and Nicole M. Virtue, husband and wife as joint tenants,
hereby GRANT(S) to
Jon Virtue and Nicole Virtue, as Co-Trustees of the Virtue Family Trust, U/A dated January 28,2015
, the following described real property in the City of Auburn, County of Placer, State of California

And returned entities are either:
[Jon Virtue, Nicole Virtue, Jon J. Virtue, Nicole M. Virtue, Virtue Family Trust, Jon J. Virtue, Nicole M. Virtue]
or:
[Jon Virtue, Nicole Virtue, Jon J. Virtue, Nicole M. Virtue]

With my limited knowledge of how NER/spancat work, it seems like some very subtle/ambiguous weight is switching prediction results. I know my domain is hinging on Semantic Role Labeling (SRL). However, in practice the prefix/suffix text is pretty static, often just 'captions'. In this particular case, some predicted entities (names) will be the same for the LABEL_FROM, LABEL_TO. This never happens in my training data, I explicitly reject such documents - but in 'live production' this will happen.

To sum it up, I am not sure where to go with this. Perhaps a remedy will be using some of the SRL contributing additional attributes.

0 replies

mbrunecky · 2021-09-28T16:17:07Z

mbrunecky
Sep 28, 2021
Author

I must admit I found an error on my side.
The reason for (both) models 'behaving nondeterministic' is that my data is not deterministic. At a first sight, yes, submitted data is the same. But in detail it turned out not to be the case.

So a big apology for wasting time (and Spacy team attention) - I must admit that the culprit is me (my data generation).
Thanks for the patience.

1 reply

svlandeg Sep 28, 2021
Maintainer

Hey @mbrunecky, that's actually a relief to hear - thanks for reporting back! These things happen. It's good that it's been figured out!

mbrunecky · 2021-09-29T17:14:40Z

mbrunecky
Sep 29, 2021
Author

I was badly bitten by Python ‘socketserver’. I am sending my data to a Python ‘prediction server’ using TCP sockets, and I was checking the data I am _sending_. Only when I started checking the data I am _receiving_, I discovered that in that special case, Python socket.recv(1024*1024) was quietly, randomly dropping half of my ~4800 bytes. The code is faithfully copied from Python examples. I had to change that code not to call socket.recv with more than a 4k buffer (I found some warning hints when I researched the problem). As predominantly Java coder, I am more used to things throwing exceptions or crashing VM… than quietly cheating. Learning never ends.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are Span Categorizer predictions repeatable? #9294

{{title}}

Replies: 6 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Are Span Categorizer predictions repeatable? #9294

mbrunecky Sep 27, 2021

Replies: 6 comments · 3 replies

svlandeg Sep 27, 2021 Maintainer

svlandeg Sep 27, 2021 Maintainer

mbrunecky Sep 27, 2021 Author

mbrunecky Sep 27, 2021 Author

mbrunecky Sep 27, 2021 Author

mbrunecky Sep 27, 2021 Author

mbrunecky Sep 28, 2021 Author

svlandeg Sep 28, 2021 Maintainer

mbrunecky Sep 29, 2021 Author

mbrunecky
Sep 27, 2021

Replies: 6 comments 3 replies

svlandeg
Sep 27, 2021
Maintainer

svlandeg
Sep 27, 2021
Maintainer

mbrunecky Sep 27, 2021
Author

mbrunecky
Sep 27, 2021
Author

mbrunecky Sep 27, 2021
Author

mbrunecky
Sep 27, 2021
Author

mbrunecky
Sep 28, 2021
Author

svlandeg Sep 28, 2021
Maintainer

mbrunecky
Sep 29, 2021
Author