Spancat, Textcat pipeline model training assistance #11663

paocarvajal1912 · 2022-10-17T21:29:27Z

paocarvajal1912
Oct 17, 2022

Hi dear community,

I have an already trained textcat model (with a few pipes), and would like to improve it adding one pipe: spancat or ner. As of now the model has a recall of 86%.

I am predicting whether the questions of a questionnaire will raise some personal information (such as name, social security number, phone, etc.). Some questions such as “name” may be responded by corporations as well, in which case is not confidential. That is why I expect that marking some key roots such as “first name”, “ssn”, and others as PERSONAL versus “corporation”, “ein”, and similar as CORP may help in the text categorization.

I am not so sure about how to put this together. With spancat, I have received valuable advice about training the spancat separately, and then combine it with the current model. Also, for the spancat, advice about using its own listener (which I don’t understand well what this means).

I ask here for assistance in how to train my new two pipelines:

[Tok2vec , parser, spancat, textcat_multilabel]
[Tok2vec, parser, ner, textcat_multilabel]

The current pipeline is: Tok2vec, parser, textcat

Let’s focus in pipeline 1 per now. I am thinking in doing as follows. Please if someone can comment about the reason of step 4:

Annotate my data with the spans in span[‘sc’] --> call it data1
Train my spancat pipeline as [Tok2Vec, spancat] with data1 per config as shown at the end of the question --> output/model-best
Instance of the spancat just trained: nlp_spancat = spacy.load("output/model-best")
Replace listener to its own listener: nlp_spancat.replace_listeners("tok2vec", "spancat", ["model.tok2vec"]) ### WHY DO I NEED THIS?
Add to data1 the annotations for textcat_multilabel (text.cats[…]) --> call it data2
Retrain my current pipeline with data2
Add the spancat pipeline to my new model using :
a. nlp_combined = spacy.load("output_current_pipeline/model-best ")
b. nlp_combined.add_pipe("spancat", source=nlp_spancat, before= textcat_multilabel)

I would like to receive your orientation whether these steps make sense, and any feedback about improvements or adjustments. Would a NER pipeline addition follow the same steps than the spancat one? Many thanks in advance for all your guidance, Paola.

Config for spancat pipeline

[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","spancat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Answered by polm

Oct 18, 2022

First, to explain why you need replace_listeners...

The "sharing embeddings" section of the docs may be helpful for understanding why you need to replace listeners.

When you train a model with a tok2vec, they learn and change together. You can think of them like interlocking puzzles pieces. But since the tok2vec changes in the process of training, if there are any components that used it before, they no longer fit together because the shape changed. So a component always needs to be used with the tok2vec it was trained with.

The listeners pattern is used so that multiple components can be trained with one tok2vec at the same time, so that they all fit together. This is faster and takes up…

View full answer

polm · 2022-10-18T05:02:36Z

polm
Oct 18, 2022

First, to explain why you need replace_listeners...

The "sharing embeddings" section of the docs may be helpful for understanding why you need to replace listeners.

When you train a model with a tok2vec, they learn and change together. You can think of them like interlocking puzzles pieces. But since the tok2vec changes in the process of training, if there are any components that used it before, they no longer fit together because the shape changed. So a component always needs to be used with the tok2vec it was trained with.

The listeners pattern is used so that multiple components can be trained with one tok2vec at the same time, so that they all fit together. This is faster and takes up less memory than having one tok2vec per component. However, this means that the component (like textcat or spancat) by itself doesn't have the tok2vec it needs, it just has a hole where the tok2vec fits - that's what the listener is. (In the implementation, a component has a .model attribute. This can be a Listener, which doesn't have weights, or a tok2vec model, which is self-contained.)

When you call replace_listeners, that replaces the listener with a whole, complete copy of the tok2vec. That means your component can function on its own. That way, when you put it in another pipeline, it will still work.

About your general idea of using spancat (or NER) annotations to help with textcat...

First, just adding the annotations doesn't have any influence on textcat at all. The textcat component can't see spancat/NER annotations directly. You can get some influence without using a custom component by training both pipelines with a tok2vec together - if you use separate tok2vecs like in the pipeline you outlined, that's easier to set up, but it means the components will be completely separate.

However, even if you do use a shared tok2vec, it's unlikely to make much of a difference, and what research there is on using NER features for document classification suggests it isn't very effective. For more details see #10470.

3 replies

paocarvajal1912 Oct 20, 2022
Author

Hi Polm,

Your answer was very helpful to understand the concept of listeners, and deciding next steps. Thank you!

To start, I run an spancat pipeline, to compare with textcat performance. To set up labels, I set up all labels under one label span["sc"] based on this advice (code at the end)

The 'sc' unique label approached worked. However, I have some difficulties to get the responses from each label efficiently. When I tried to access the data I get:

nlp=spacy.load("output/model-best")
docs = nlp.pipe(X_test)

for doc in docs:
    print(doc.text)
    print(doc.to_json()['spans'])
    print(doc.spans["sc"].attrs["scores"])

OUTPUT:
spouses last name
{'sc': [{'start': 8, 'end': 17, 'label': 'PERSONAL', 'kb_id': ''}, {'start': 0, 'end': 17, 'label': 'PII_ROOT', 'kb_id': ''}]}
[0.9995832 0.9977139]

Is there a way to pass the labels individually, instead of all combined in "sc"? I tried the way below, but got the error "ValueError: [E143] Labels for component 'spancat' not initialized[..]" as in here. A method similar to this would be my preference.

# Initializaing the SpanGroups for each doc
for label in labels:
    doc.spans[label]=[]    

doc.spans[match_label].append(span) # ORIGINAL

If using span["sc"] is the way to go, could you provide some guidance to get the scores for all docs and all labels in an efficient way? Thanks much in advance.

Here some code extractions. I would appreciate any feedback you want to give me to improve my code. Many thanks again.

labels = ['CORP_ROOT', 'PERSONAL', 'PII_ROOT']
       
# Setting matcher to mark spans
phrase_matcher = PhraseMatcher(nlp.vocab)

# Setting pattern names based on my labels
pattern_names = [f"patterns_{label}" for label in labels]

# Setting the keywords associated to each label to make annotations
for label in labels:
    label_terms[label] = keywords_df.loc[keywords_df[label]==1]['FieldText'].to_list()

for label, pattern in zip(labels, pattern_names): 
    pattern = [nlp.make_doc(term) for term in label_terms[label]] 
    
    # Adding patterns. 
    phrase_matcher.add(f"{label}", pattern)

docs_from_data = list(nlp.pipe(data_df["FieldText"]))

for doc in docs_from_data:
    phrase_matches = phrase_matcher(doc)
    #StackOverflow change
    doc.spans["sc"]=[]
            
    # phrase_matches detection and labeling of spans, and generation of SpanGroups for each doc
    for match_id, start, end in phrase_matches:
            match_label = nlp.vocab.strings[match_id]
            span_text = doc[start:end].text
            span = Span(doc, start, end, label = match_label)
            
            # Set up of the SpanGroup for each doc, for the different labels
            doc.spans["sc"].append(span) # Recommendation stackoverflow

polm Oct 21, 2022

It sounds like you want to extract a list of just the spans with a given label? If so you can do something like this:

my_label = "PERSONAL" # for example
span_group = doc.spans["sc"]
for span, score in zip(span_group, span_group.attrs["scores"]):
    if span.label_ != my_label:
        continue
    print(score, span, sep="\t")

There is no reason to use as_json here - it's intended for serialization, but I wouldn't expect you to need it generally.

If you want to do this for all labels in a general way, you can group the spans by label using a dictionary or something. I see that you'd like to have a separate span group for each label type - you can copy the spans and scores there too, for example:

span_group = doc.spans["sc"]
for span, score in zip(span_group, span_group.attrs["scores"]):
    if span.label_ not in doc.spans:
        doc.spans[span.label_] = []
        doc.spans[span.label_].attrs["scores"] = []
    doc.spans[span.label_].append(span)
    doc.spans[span.label_].attrs["scores"].append(score)

The sample code you gave seems to be for preparing training data, but each instance of spancat can only use one spans key - if you want a separate key for each label, you'd need to train a separate spancat for each one. You can do that but I wouldn't recommend it - it's helpful to the model to see how labels interact.

paocarvajal1912 Oct 21, 2022
Author

Thanks so much Polm. This is exactly what I needed to continue with my evaluation. Appreciate that you responded all my questions. Moreover, so promptly. Cannot thank you enough. All the best,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spancat, Textcat pipeline model training assistance #11663

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Spancat, Textcat pipeline model training assistance #11663

paocarvajal1912 Oct 17, 2022

Replies: 1 comment · 3 replies

polm Oct 18, 2022

paocarvajal1912 Oct 20, 2022 Author

polm Oct 21, 2022

paocarvajal1912 Oct 21, 2022 Author

paocarvajal1912
Oct 17, 2022

Replies: 1 comment 3 replies

polm
Oct 18, 2022

paocarvajal1912 Oct 20, 2022
Author

paocarvajal1912 Oct 21, 2022
Author