Spancat, Textcat pipeline model training assistance #11663
-
Hi dear community, I have an already trained textcat model (with a few pipes), and would like to improve it adding one pipe: I am predicting whether the questions of a questionnaire will raise some personal information (such as name, social security number, phone, etc.). Some questions such as “name” may be responded by corporations as well, in which case is not confidential. That is why I expect that marking some key roots such as “first name”, “ssn”, and others as PERSONAL versus “corporation”, “ein”, and similar as CORP may help in the text categorization. I am not so sure about how to put this together. With I ask here for assistance in how to train my new two pipelines:
The current pipeline is: Let’s focus in pipeline 1 per now. I am thinking in doing as follows. Please if someone can comment about the reason of step 4:
I would like to receive your orientation whether these steps make sense, and any feedback about improvements or adjustments. Would a NER pipeline addition follow the same steps than the spancat one? Many thanks in advance for all your guidance, Paola. Config for spancat pipeline
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
First, to explain why you need The "sharing embeddings" section of the docs may be helpful for understanding why you need to replace listeners. When you train a model with a tok2vec, they learn and change together. You can think of them like interlocking puzzles pieces. But since the tok2vec changes in the process of training, if there are any components that used it before, they no longer fit together because the shape changed. So a component always needs to be used with the tok2vec it was trained with. The listeners pattern is used so that multiple components can be trained with one tok2vec at the same time, so that they all fit together. This is faster and takes up less memory than having one tok2vec per component. However, this means that the component (like textcat or spancat) by itself doesn't have the tok2vec it needs, it just has a hole where the tok2vec fits - that's what the listener is. (In the implementation, a component has a When you call About your general idea of using spancat (or NER) annotations to help with textcat... First, just adding the annotations doesn't have any influence on textcat at all. The textcat component can't see spancat/NER annotations directly. You can get some influence without using a custom component by training both pipelines with a tok2vec together - if you use separate tok2vecs like in the pipeline you outlined, that's easier to set up, but it means the components will be completely separate. However, even if you do use a shared tok2vec, it's unlikely to make much of a difference, and what research there is on using NER features for document classification suggests it isn't very effective. For more details see #10470. |
Beta Was this translation helpful? Give feedback.
First, to explain why you need
replace_listeners
...The "sharing embeddings" section of the docs may be helpful for understanding why you need to replace listeners.
When you train a model with a tok2vec, they learn and change together. You can think of them like interlocking puzzles pieces. But since the tok2vec changes in the process of training, if there are any components that used it before, they no longer fit together because the shape changed. So a component always needs to be used with the tok2vec it was trained with.
The listeners pattern is used so that multiple components can be trained with one tok2vec at the same time, so that they all fit together. This is faster and takes up…