Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]Binary vector support with Lucene #1857

Closed
heemin32 opened this issue Jul 18, 2024 · 7 comments · Fixed by #2292
Closed

[FEATURE]Binary vector support with Lucene #1857

heemin32 opened this issue Jul 18, 2024 · 7 comments · Fixed by #2292
Assignees

Comments

@heemin32
Copy link
Collaborator

Similar to #1764, I like to see binary vector support with Lucene engine

@jed326
Copy link
Contributor

jed326 commented Nov 12, 2024

I will pick this one up, thanks!

@jed326 jed326 self-assigned this Nov 12, 2024
@jed326
Copy link
Contributor

jed326 commented Nov 27, 2024

Issue

The main issue here is that the Lucene VectorSimilarityFunction [1] does not define the HAMMING distance calculation and [2] is implemented as a java enum so it's not possible to extend with our own compare functions.

It looks like this issue has been discussed a few places in Lucene:

And ultimately what Lucene has done, at least for now, is rather than introduce HAMMING to the VectorSimilarityFunction they have made it easier to extend the FlatVectorsFormat class to provide a custom scorer, and with that scorer we can do whatever score computations we would like.

Changes Needed

Mapper related changes:

  • Update LuceneHNSWMethod to support binary data type and hamming space
  • Since we won't be taking the scoring method from the VectorSimilarityFunction we need to double check the validations for not supporting anything other than hamming.

VectorDataType changes:

  • For the BINARY data type we can use the KnnByteVectorField and adjust the dimension by 8 accordingly, so basically we store an 8 dimension binary vector as a 1 dimensional byte vector. Conceptually, we are already doing this in other places.

PerFieldKnnVectorsFormat changes:

  • In order to calculate the hamming distance we need to implement a custom FlatVectorsScorer, which means a custom KnnVectorsFormat is also needed. So it would look something like BinaryVectorsFormat → Lucene99FlatVectorsFormat → BinaryFlatVectorsScorer
  • Lucene has some sample implementations of this in HnswBitVectorsFormat and FlatBitVectorsScorer, but since these are not in the backwards compatible packages we probably shouldn't use them yet and instead implement our own classes.

@heemin32 @navneet1v @jmazanec15 do these changes sound reasonable to you? Let me know if you think I missed any big parts too.

@jed326
Copy link
Contributor

jed326 commented Nov 27, 2024

I think there's potentially some more discussion to be had with Lucene as well. It seems that now there are 2 ways to provide distance calculations:

  1. The existing VectorSimilarityFunction, which is actually encoded into the segment file itself
  2. Via a customer scorer through a custom KnnVectorsFormat.

This seems confusing because this means the format can completely ignore what is encoded into it's own file, so at the very least I think we can probably start a discussion on how the future of VectorSimilarityFunction looks and how Lucene plans on either deprecating it or enhancing it.

@navneet1v
Copy link
Collaborator

@jed326 Thanks for all the details.
From the changes standpoint more less I am aligned that we need all those changes. I just have question on this:

Lucene has some sample implementations of this in HnswBitVectorsFormat and FlatBitVectorsScorer, but since these are not in the backwards compatible packages we probably shouldn't use them yet and instead implement our own classes.

If a codec is not present in backwards compatible codec then it doesn't mean that it shouldn't be use. Actually its the other way round.

Nevertheless, as per this comment: apache/lucene#13288 (comment) Lucene doesn't officially support the BitVectors so we have to create our own KNNVectorsFormat and Scorer. But we can take some reference from BitVectorsFormat from Lucene.

Another reason for the implementation is, even if we make changes in Lucene, Lucene is already moved to 10.x version and Opensearch currently have no plans to upgrade Lucene to 10. So, in the light of that I think we should implement out own format.

I think there's potentially some more discussion to be had with Lucene as well. It seems that now there are 2 ways to provide distance calculations:

  1. The existing VectorSimilarityFunction, which is actually encoded into the segment file itself
  2. Via a customer scorer through a custom KnnVectorsFormat.

This seems confusing because this means the format can completely ignore what is encoded into it's own file, so at the very least I think we can probably start a discussion on how the future of VectorSimilarityFunction looks and how Lucene plans on either deprecating it or enhancing it.

On this I completely agree that we should start a discussion on Lucene and understand what is the long term plan/direction.

@jed326
Copy link
Contributor

jed326 commented Nov 27, 2024

Thanks @navneet1v. I opened a discussion on Lucene for this: apache/lucene#14025, please take a look when you get a chance!

@github-project-automation github-project-automation bot moved this from Backlog (Hot) to ✅ Done in Vector Search RoadMap Dec 14, 2024
@owenhalpert
Copy link
Contributor

owenhalpert commented Jan 17, 2025

Benchmarking Results:

FP32 Binary
Data node configuration 8 ✗ r5.4xlarge 2 ✗ r5.2xlarge
# of CPUs 128 VCPU 16 VCPU
Indexing throughput (mean) 11805 docs/sec 6428 docs/sec
Query latency p90 8.02ms 11.04ms
Query latency p99 8.79ms 12.98ms
Query latency p99.9 13.57ms 20.87ms
Storage for index 1751.8gb 46.5gb

The benchmarking was conducted on two clusters that were identical aside from their data nodes. The FP32 vectors used 8 r5.4xlarge data nodes while the binary cluster used 2 r5.2xlarge.

In line with the Faiss results, the Lucene binary vector implementation was able to deliver comparable query latency times while reducing storage by 97%.

@jed326
Copy link
Contributor

jed326 commented Jan 17, 2025

Thanks @owenhalpert for gathering the benchmarking results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants