Relation extraction component - assertion error raised #12755
-
IssueWhile using the relation extraction component, I'm regularly coming across an AssertionError related to thinc while training the component on data annotated with Prodigy. Code and configurationCode has been updated following a thread in Prodigy discussion forum. Updated code and configuration files are available in this repository. I've deleted assets, data and training folders as well as erased my labels in SYMM_LABEL and DIRECTED_LABELS in scripts/parse_data_generic.py file to ensure data confidentiality. CLIData command output :
Training command output (raising the error) :
Occasionnaly, on other samples of training data, I also encounter the following error during the data command execution :
Data sampleFor confidentialy issues, I can't share the training data. It has been annotated with Prodigy. Here is a small example to check the data format returned by Prodigy :
It seems like the data is not correctly read. It is named annotations.jsonl and put in assets folder. It is created thanks to the db-out command from Prodigy. If you need any additional information; please let me know. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Hi Stella, From your config file, I understand that you're training the NER model and the REL model at the same time, adding the NER to Yet, we're getting this warning:
This warning points to the fact that the relation extractor doesn't have any instances to train on. This most likely happens because at that point in time, the NER wasn't yet trained sufficiently to actually recognize the entities from the gold REL data, and thus the REL couldn't continue onwards.
This For the REL to be able to work, we need to make sure that the NER actually works by itself. Have you tried training the NER in isolation (i.e. with no relations?). If you have a sufficiently well working NER model, then we can use that as a frozen model in the next phase when training the REL. I think this will result in a more stable approach. |
Beta Was this translation helpful? Give feedback.
-
Hi Stella, Thanks, that's all very useful.
Totally - I agree. I wanted to check whether you already have more data annotated right now, to verify whether the errors still occur if you'd use all of the data available.
That's interesting. In the case of the "very few data inputs", what do you mean by "works perfectly", while you're also hypothesizing that the training might be skipped? Do you mean that the training runs without error, but the score is really just all 0?
Right, that's good to know. To me, this again points to the fact that the REL breaks down when the NER model is being trained and is not yet stable.
I want to clarify once more that, as stated in the video tutorial, the REL project serves as a tutorial on how to implement a custom spaCy component with a custom Thinc model. It is definitely not meant to serve as some kind of stable REL component. If we do develop an actual proper REL component in the future, we would include it in spaCy's core code base, not in the "tutorials" section of our example projects. I'm saying this to clarify that our main motivation here is not to develop a stable REL component. It's to help you get going with your specific use-case and to give you pointers on how to implement your own custom NLP solution with spaCy.
The approach you've tried so far could have worked, but at this point I suggest a more phased approach that lets us tackle your challenges one by one. You've already demonstrated that the NER model training works, so that would be the first step: train an NER model from your data, and store the model to disk (let's say it's in a directory In the next step then, we'll train the REL model only, while keeping the
What this effectively does, is that a new pipeline is composed that grabs the trained The Finally, we use the Once this pipeline is trained (🤞) and saved to disk, to let's say
and run it on text, at which point the pipeline will create predictions for both the NER and the REL components. |
Beta Was this translation helpful? Give feedback.
Hi Stella,
Thanks, that's all very useful.
Totally - I agree. I wanted to check whether you already have more data annotated right now, to verify whether the errors still occur if you'd use all of the data available.