MPEmbeddings result differs from sentence_transformers #14494
-
I'm trying to Sparkify some code that previously used sentence_transformers, but the vectors I get back using MPEmbeddings differ from the ones i get back using sentence_transformers. Why might this be? The sentence_transformers code:
and the Spark-NLP code:
I would expect these to produce the same values because they use the same model, but they don't. What am I missing? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 9 replies
-
From running bits of the two implementations in REPLs, it looks like Spark-NLPs MPNetEmbeddings tokenizer doesn't add tokens for Passing the string "boring things" to each tokenizer implementation: Note the missing 0 and 2 in the Spark tokens |
Beta Was this translation helpful? Give feedback.
This behaviour was introduced in Spark-NLP 5.5.2 by this commit. I've confirmed this by downgrading to 5.5.1 and then the embedding values match those produced by SentenceTransformers.