POS Tagging is Broken for Sliced Pipelines #13225

lordsoffallen · 2024-01-06T23:35:39Z

lordsoffallen
Jan 6, 2024

Hey everyone,

I'm trying to lemmatize a text which I cleaned earlier. The issue I had was due to runtime so I decided to cut down certain pipelines out since I wanted lemmas only. When I only enable lemmas I got some warnings but I also wanted to filter based on POS tags such as ['ADJ', 'NOUN', 'VERB', 'ADV']. In order to generate .pos_ attribute, I enabled pipeline components for that which documentation said tagger and parser. However using those only doesn'y really work here as I am not getting expected POS tags. When I use the full pipeline I get expected results but not when I use certain pipelines. Is this behaviour expected? If so, why? How do I know which pipelines to exclude as I am a bit of confused now.

Thanks in advance!

How to reproduce the behaviour

Here is the code sample that doesn't work:

nlp = spacy.load('en_core_web_sm', enable=['lemmatizer', 'tagger', "parser", "attribute_ruler"])

text = """
If you like the taste of Sweet Low get this If you don t don t Couldn t get through one cup of coffee 
I m gonna give Stevia Extract in the Raw a try It s made by the folks at Sugar in the Raw Here s 
what they claim Stevia Extract In The Raw gets its delicious natural sweetness from Rebiana an 
extract from the Stevia plant This extract is the sweetest part of the plant and has recently
been isolated to provide pure sweetening power without the licorice like aftertaste that many 
of our predecessors exhibited All you get is the sweet flavor without any calories 
We ll see Simply Stevia is simply nasty
"""

print([t.pos_ for t in nlp(text)])

The one that works:

nlp = spacy.load('en_core_web_sm')
print([t.pos_ for t in nlp(text)])

Your Environment

spaCy version: 3.7.2
Platform: Linux-5.15.133+-x86_64-with-glibc2.31
Python version: 3.10.12
Pipelines: en_core_web_sm (3.7.1), en_core_web_lg (3.7.1)

Answered by svlandeg

Jan 8, 2024

Hi!

Sorry that this has been confusing. What you need, is to also ensure the tok2vec component is enabled:

nlp = spacy.load('en_core_web_sm', enable=['tok2vec', 'lemmatizer', 'tagger', "parser", "attribute_ruler"])

If you look at the en_core_web_sm package that's installed in your venv, you can open the config.cfg and find something like this:

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "tok2vec"

What this means, is that the tagger model uses the tok2vec component in the pipeline - it "listens" to it to obtain word embeddings. The parser does, too. So you should make sure to enable them together.

View full answer

svlandeg · 2024-01-08T15:40:31Z

svlandeg
Jan 8, 2024
Maintainer

Hi!

Sorry that this has been confusing. What you need, is to also ensure the tok2vec component is enabled:

nlp = spacy.load('en_core_web_sm', enable=['tok2vec', 'lemmatizer', 'tagger', "parser", "attribute_ruler"])

If you look at the en_core_web_sm package that's installed in your venv, you can open the config.cfg and find something like this:

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "tok2vec"

What this means, is that the tagger model uses the tok2vec component in the pipeline - it "listens" to it to obtain word embeddings. The parser does, too. So you should make sure to enable them together.

You can read more about the shared tok2vec layer in our docs here: https://spacy.io/usage/embeddings-transformers#embedding-layers

0 replies

lordsoffallen · 2024-01-09T16:00:14Z

lordsoffallen
Jan 9, 2024
Author

Hey thanks for your answer. It wasnt clear in the doc which process has a dependecy on what. Perhaps better dependecy visualization help or certain error messages? If tagger needs toke2vec and user didn't enable it should throw error no? I ended up using pos_tag as it ran faster than running the whole spacy pipeline. When I also check the docs I don't see any info on tok2vec dependecy. I couldn't figure this out if it weren't for your answer.

Finally, I strongly believe showing how these components interact or use each other in the doc would really be nice. Moreover I believe code should in my case instead of providing nonsensical results.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POS Tagging is Broken for Sliced Pipelines #13225

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

POS Tagging is Broken for Sliced Pipelines #13225

lordsoffallen Jan 6, 2024

How to reproduce the behaviour

Your Environment

Replies: 2 comments

svlandeg Jan 8, 2024 Maintainer

lordsoffallen Jan 9, 2024 Author

lordsoffallen
Jan 6, 2024

svlandeg
Jan 8, 2024
Maintainer

lordsoffallen
Jan 9, 2024
Author