[FEATURE]Binary vector support with Lucene #1857

heemin32 · 2024-07-18T23:18:22Z

Similar to #1764, I like to see binary vector support with Lucene engine

jed326 · 2024-11-12T18:04:39Z

I will pick this one up, thanks!

jed326 · 2024-11-27T01:08:34Z

Issue

The main issue here is that the Lucene VectorSimilarityFunction [1] does not define the HAMMING distance calculation and [2] is implemented as a java enum so it's not possible to extend with our own compare functions.

It looks like this issue has been discussed a few places in Lucene:

And ultimately what Lucene has done, at least for now, is rather than introduce HAMMING to the VectorSimilarityFunction they have made it easier to extend the FlatVectorsFormat class to provide a custom scorer, and with that scorer we can do whatever score computations we would like.

Changes Needed

Mapper related changes:

Update LuceneHNSWMethod to support binary data type and hamming space
Since we won't be taking the scoring method from the VectorSimilarityFunction we need to double check the validations for not supporting anything other than hamming.

VectorDataType changes:

For the BINARY data type we can use the KnnByteVectorField and adjust the dimension by 8 accordingly, so basically we store an 8 dimension binary vector as a 1 dimensional byte vector. Conceptually, we are already doing this in other places.

PerFieldKnnVectorsFormat changes:

In order to calculate the hamming distance we need to implement a custom FlatVectorsScorer, which means a custom KnnVectorsFormat is also needed. So it would look something like BinaryVectorsFormat → Lucene99FlatVectorsFormat → BinaryFlatVectorsScorer
Lucene has some sample implementations of this in HnswBitVectorsFormat and FlatBitVectorsScorer, but since these are not in the backwards compatible packages we probably shouldn't use them yet and instead implement our own classes.

@heemin32 @navneet1v @jmazanec15 do these changes sound reasonable to you? Let me know if you think I missed any big parts too.

jed326 · 2024-11-27T01:21:37Z

I think there's potentially some more discussion to be had with Lucene as well. It seems that now there are 2 ways to provide distance calculations:

The existing VectorSimilarityFunction, which is actually encoded into the segment file itself
Via a customer scorer through a custom KnnVectorsFormat.

This seems confusing because this means the format can completely ignore what is encoded into it's own file, so at the very least I think we can probably start a discussion on how the future of VectorSimilarityFunction looks and how Lucene plans on either deprecating it or enhancing it.

navneet1v · 2024-11-27T07:16:59Z

@jed326 Thanks for all the details.
From the changes standpoint more less I am aligned that we need all those changes. I just have question on this:

Lucene has some sample implementations of this in HnswBitVectorsFormat and FlatBitVectorsScorer, but since these are not in the backwards compatible packages we probably shouldn't use them yet and instead implement our own classes.

If a codec is not present in backwards compatible codec then it doesn't mean that it shouldn't be use. Actually its the other way round.

Nevertheless, as per this comment: apache/lucene#13288 (comment) Lucene doesn't officially support the BitVectors so we have to create our own KNNVectorsFormat and Scorer. But we can take some reference from BitVectorsFormat from Lucene.

Another reason for the implementation is, even if we make changes in Lucene, Lucene is already moved to 10.x version and Opensearch currently have no plans to upgrade Lucene to 10. So, in the light of that I think we should implement out own format.

I think there's potentially some more discussion to be had with Lucene as well. It seems that now there are 2 ways to provide distance calculations:

The existing VectorSimilarityFunction, which is actually encoded into the segment file itself

Via a customer scorer through a custom KnnVectorsFormat.

This seems confusing because this means the format can completely ignore what is encoded into it's own file, so at the very least I think we can probably start a discussion on how the future of VectorSimilarityFunction looks and how Lucene plans on either deprecating it or enhancing it.

On this I completely agree that we should start a discussion on Lucene and understand what is the long term plan/direction.

jed326 · 2024-11-27T22:49:10Z

Thanks @navneet1v. I opened a discussion on Lucene for this: apache/lucene#14025, please take a look when you get a chance!

owenhalpert · 2025-01-17T18:51:50Z

Benchmarking Results:

	FP32	Binary
Data node configuration	8 ✗ `r5.4xlarge`	2 ✗ `r5.2xlarge`
# of CPUs	128 VCPU	16 VCPU
Indexing throughput (mean)	11805 docs/sec	6428 docs/sec
Query latency p90	8.02ms	11.04ms
Query latency p99	8.79ms	12.98ms
Query latency p99.9	13.57ms	20.87ms
Storage for index	1751.8gb	46.5gb

The benchmarking was conducted on two clusters that were identical aside from their data nodes. The FP32 vectors used 8 r5.4xlarge data nodes while the binary cluster used 2 r5.2xlarge.

In line with the Faiss results, the Lucene binary vector implementation was able to deliver comparable query latency times while reducing storage by 97%.

jed326 · 2025-01-17T19:22:13Z

Thanks @owenhalpert for gathering the benchmarking results!

heemin32 added enhancement v2.17.0 labels Jul 18, 2024

heemin32 added this to Vector Search RoadMap Jul 18, 2024

github-project-automation bot moved this to Backlog in Vector Search RoadMap Jul 18, 2024

github-actions bot added the untriaged label Jul 18, 2024

heemin32 moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Jul 18, 2024

heemin32 removed the untriaged label Jul 18, 2024

naveentatikonda removed the v2.17.0 label Sep 18, 2024

jed326 self-assigned this Nov 12, 2024

jed326 mentioned this issue Nov 27, 2024

Add binary index support for Lucene engine #2292

Merged

6 tasks

navneet1v closed this as completed in #2292 Dec 14, 2024

github-project-automation bot moved this from Backlog (Hot) to ✅ Done in Vector Search RoadMap Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]Binary vector support with Lucene #1857

[FEATURE]Binary vector support with Lucene #1857

heemin32 commented Jul 18, 2024

jed326 commented Nov 12, 2024

jed326 commented Nov 27, 2024

jed326 commented Nov 27, 2024 •

edited

Loading

navneet1v commented Nov 27, 2024

jed326 commented Nov 27, 2024

owenhalpert commented Jan 17, 2025 •

edited

Loading

jed326 commented Jan 17, 2025

[FEATURE]Binary vector support with Lucene #1857

[FEATURE]Binary vector support with Lucene #1857

Comments

heemin32 commented Jul 18, 2024

jed326 commented Nov 12, 2024

jed326 commented Nov 27, 2024

Issue

Changes Needed

jed326 commented Nov 27, 2024 • edited Loading

navneet1v commented Nov 27, 2024

jed326 commented Nov 27, 2024

owenhalpert commented Jan 17, 2025 • edited Loading

jed326 commented Jan 17, 2025

jed326 commented Nov 27, 2024 •

edited

Loading

owenhalpert commented Jan 17, 2025 •

edited

Loading