MPEmbeddings result differs from sentence_transformers #14494

AlexLamaison-UnlikelyAI · 2025-01-02T18:08:42Z

AlexLamaison-UnlikelyAI
Jan 2, 2025

I'm trying to Sparkify some code that previously used sentence_transformers, but the vectors I get back using MPEmbeddings differ from the ones i get back using sentence_transformers. Why might this be?

The sentence_transformers code:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")
model.encode("Description: interlobar artery. Known as: 'interlobar artery'.").tolist()

and the Spark-NLP code:

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Spark NLP MPNet") \
    .getOrCreate()
document = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
mpnet = MPNetEmbeddings.pretrained("all_mpnet_base_v2", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("embeddings")

pipeline = Pipeline(stages=[document, mpnet])

data = spark.createDataFrame([["Description: interlobar artery. Known as: 'interlobar artery'."]], ["text"])
result = pipeline.fit(data).transform(data)
result.show(truncate=False)

I would expect these to produce the same values because they use the same model, but they don't. What am I missing?

Answered by AlexLamaison-UnlikelyAI

Jan 6, 2025

This behaviour was introduced in Spark-NLP 5.5.2 by this commit. I've confirmed this by downgrading to 5.5.1 and then the embedding values match those produced by SentenceTransformers.

View full answer

AlexLamaison-UnlikelyAI · 2025-01-03T17:48:34Z

AlexLamaison-UnlikelyAI
Jan 3, 2025
Author

From running bits of the two implementations in REPLs, it looks like Spark-NLPs MPNetEmbeddings tokenizer doesn't add tokens for <s> and </s> to the start and the end, but the HuggingFace tokenizer used by SentenceTransformers does. Is the Spark implementation incorrect.

Passing the string "boring things" to each tokenizer implementation:
Spark NLP: Array(TokenPiece(boring,boring,11775,true,0,5), TokenPiece(thing,thing,2522,true,0,4))
SentenceTransformers: {'input_ids': tensor([[ 0, 11775, 2522, 2]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

Note the missing 0 and 2 in the Spark tokens

9 replies

AlexLamaison-UnlikelyAI Jan 6, 2025
Author

Thanks @AlexLamaison-UnlikelyAI. I'm happy to collaborate on this if it's helpful.

AlexLamaison-UnlikelyAI Jan 6, 2025
Author

I think I have it. The attention mask for "boring thing" in Spark NLP ends up as 0, 1, 1, 1 where the other implementations use 1, 1, 1, 1. It comes down to a line:
val attentionMask = batch.map(sentence => sentence.map(x => if (x < this.paddingTokenId) 0L else 1L)).toArray.

paddingTokenId is 1 so this inserts 0 in the mask where the start sequence token ID is 0.

I'm not sure what the correct fix is here.

AlexLamaison-UnlikelyAI Jan 6, 2025
Author

This screenshot is from a debugger attached to Spark-NLP just after it's called the model. You can see the funny attention mask and the incorrect embedding.

This screenshot is from a test case I created using the same Onnx Java code but passing in the correct mask. When mean-pooled and normalised, this code produces the same values as calling SentenceTransformers in Python.

AlexLamaison-UnlikelyAI Jan 6, 2025
Author

This behaviour was introduced in Spark-NLP 5.5.2 by this commit. I've confirmed this by downgrading to 5.5.1 and then the embedding values match those produced by SentenceTransformers.

Answer selected by AlexLamaison-UnlikelyAI

maziyarpanahi Jan 6, 2025
Maintainer

Thank you @AlexLamaison-UnlikelyAI - you nailed it with your debugging! So this line here:

val attentionMask = batch.map(sentence => sentence.map(x => if (x < this.paddingTokenId) 0L else 1L)).toArray

A quick explanation, when creating attention masks for MPNet (or really any Transformer-based model), the standard rule of thumb is:
• 1 for any non-padding token.
• 0 for any padding token.

The change was this.paddingTokenId instead of a hard-coded 0 which makes sense because the id for the padding can be different for different models.

Now I am not sure why is the paddingTokenId is also hard-coded to 1 instead of using the pad coming from the actual vocab dictionary like other features. So we should fix this and hopefully it will use the correct padding id. I will also replace < with == since we are looking for the exact id of the padding token:

Many thanks @AlexLamaison-UnlikelyAI for all the good work! I'll make a fix PR and will ask you to review it before the release.

maziyarpanahi Jan 6, 2025
Maintainer

I can confirm, just replacing < with == will fix the issue. We are actually looking for either tokens that are exactly paddingTokenId == or they are not !=, anything else should be fine.

I will make a PR for the fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPEmbeddings result differs from sentence_transformers #14494

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

MPEmbeddings result differs from sentence_transformers #14494

AlexLamaison-UnlikelyAI Jan 2, 2025

Replies: 1 comment · 9 replies

AlexLamaison-UnlikelyAI Jan 3, 2025 Author

AlexLamaison-UnlikelyAI Jan 6, 2025 Author

AlexLamaison-UnlikelyAI Jan 6, 2025 Author

AlexLamaison-UnlikelyAI Jan 6, 2025 Author

AlexLamaison-UnlikelyAI Jan 6, 2025 Author

maziyarpanahi Jan 6, 2025 Maintainer

maziyarpanahi Jan 6, 2025 Maintainer

AlexLamaison-UnlikelyAI
Jan 2, 2025

Replies: 1 comment 9 replies

AlexLamaison-UnlikelyAI
Jan 3, 2025
Author

AlexLamaison-UnlikelyAI Jan 6, 2025
Author

AlexLamaison-UnlikelyAI Jan 6, 2025
Author

AlexLamaison-UnlikelyAI Jan 6, 2025
Author

AlexLamaison-UnlikelyAI Jan 6, 2025
Author

maziyarpanahi Jan 6, 2025
Maintainer

maziyarpanahi Jan 6, 2025
Maintainer