Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces #13872

Closed
wants to merge 27 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0b76ac3
refactor float vector values random access
Oct 2, 2024
1c2977f
refactor byte vector values random access
Oct 6, 2024
debac32
make sure KnnVectorValues.iterator() always returns a new value
Oct 6, 2024
273e8ed
fix cloning/sharing of vector scorer resources
Oct 7, 2024
ce70f4c
renaming
Oct 7, 2024
b4febca
tidy
Oct 7, 2024
2fca27c
more renaming
Oct 7, 2024
2e51380
EMPTY
Oct 7, 2024
3b8d70f
CHANGES and MIGRATE entries
Oct 7, 2024
f5e0260
a little more renaming
Oct 7, 2024
9c68a6e
mopping up some more values->vectors
Oct 7, 2024
f035183
fix javadoc
Oct 7, 2024
23c7497
fix error introduced in refactoring (init lastSubIndex to -1 instead …
Oct 7, 2024
63a4d83
Add BaseKnnVectorsFormatTestCase.testRecall() and fix map ord to doc …
Oct 14, 2024
2099589
Add BaseKnnVectorsFormatTestCase.testRecall() and fix map ord to doc …
Oct 14, 2024
6141900
handle stray prints
Oct 17, 2024
61a0d79
test all similarities and more queries
Oct 17, 2024
5a6d709
fix Lucene90Hnsw that was aliasing vector values
Oct 17, 2024
bbe4d28
remove stray print
Oct 18, 2024
1a2c3bb
Merge remote-tracking branch 'origin/main' into knn-dictionary
Oct 18, 2024
568372f
fix initialization bug in SlowCompositeCodecReaderWrapper
Oct 18, 2024
da06288
simplifications from PR feedback
Oct 22, 2024
ed233ba
fix off-heap scorer by falling back to on-heap.
ChrisHegarty Oct 29, 2024
5c2cb2d
Merge branch 'main' into knn-dictionary
ChrisHegarty Oct 29, 2024
4284360
fix aliasing of vector scratch in quantized scorer
Nov 7, 2024
f1e0007
update flat vectors scorer to use only two vector dictionaries
ChrisHegarty Nov 8, 2024
ef13bad
reuse Floats and RandomVectorScorers
Nov 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions lucene/CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@ API Changes
* GITHUB#13957: Removed LeafSimScorer class, to save its overhead. Scorers now
compute scores directly from a SimScorer, postings and norms. (Adrien Grand)

* GITHUB#13831: Complete refactoring of random-access vector API, eliminating copy() method. Now random-access vectors
are accessed by calling Byte/FloatVectorValues.vectors().get(int).

New Features
---------------------
(No changes)
Expand Down
4 changes: 4 additions & 0 deletions lucene/MIGRATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -905,3 +905,7 @@ segments are rewritten either via `IndexWriter.forceMerge` or
### Vector values APIs switched to primarily random-access

`{Byte/Float}VectorValues` no longer inherit from `DocIdSetIterator`. Rather they extend a common class, `KnnVectorValues`, that provides a random access API (previously provided by `RandomAccessVectorValues`, now removed), and an `iterator()` method for retrieving `DocIndexIterator`: an iterator which is a DISI that also provides an `index()` method. Therefore, any iteration over vector values must now be performed using the values' `iterator()`. Random access works as before, but does not require casting to `RandomAccessVectorValues`.

## Migration from Lucene 10.0 to Lucene 10.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be at the top? This file has most recent versions at the top.


The refactoring of random-access vector API begun in 10.0 is completed in 10.1, where `{Byte/Float}VectorValues.copy()` methods have been removed. It is no longer necessary to copy instances of `KnnVectorValues` in order to obtain unique vector sources that do not share underlying data structures. Instead, random-access vectors are accessed via `{Byte/Float}VectorValues.vectors().get(int)`. The `Bytes`/`Floats` instances returned from `{Byte/Float}VectorValues.vectors()` now encapsulate non-shareable storage.
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@

package org.apache.lucene.analysis.synonym.word2vec;

import java.io.IOException;
import org.apache.lucene.index.FloatVectorValues;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefHash;
Expand All @@ -44,26 +43,20 @@ public Word2VecModel(int dictionarySize, int vectorDimension) {
this.word2Vec = new BytesRefHash();
}

private Word2VecModel(
int dictionarySize,
int vectorDimension,
TermAndVector[] termsAndVectors,
BytesRefHash word2Vec) {
this.dictionarySize = dictionarySize;
this.vectorDimension = vectorDimension;
this.termsAndVectors = termsAndVectors;
this.word2Vec = word2Vec;
}

public void addTermAndVector(TermAndVector modelEntry) {
modelEntry = modelEntry.normalizeVector();
this.termsAndVectors[loadedCount++] = modelEntry;
this.word2Vec.add(modelEntry.term());
}

@Override
public float[] vectorValue(int targetOrd) {
return termsAndVectors[targetOrd].vector();
public Floats vectors() {
return new Floats() {
@Override
public float[] get(int targetOrd) {
return termsAndVectors[targetOrd].vector();
}
};
}

public float[] vectorValue(BytesRef term) {
Expand All @@ -86,10 +79,4 @@ public int dimension() {
public int size() {
return dictionarySize;
}

@Override
public Word2VecModel copy() throws IOException {
return new Word2VecModel(
this.dictionarySize, this.vectorDimension, this.termsAndVectors, this.word2Vec);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ public final class Lucene90HnswGraphBuilder {
private final Lucene90NeighborArray scratch;

private final VectorSimilarityFunction similarityFunction;
private final FloatVectorValues vectorValues;
private final FloatVectorValues.Floats vectors;
private final SplittableRandom random;
private final Lucene90BoundsChecker bound;
final Lucene90OnHeapHnswGraph hnsw;
Expand All @@ -58,13 +58,13 @@ public final class Lucene90HnswGraphBuilder {

// we need two sources of vectors in order to perform diversity check comparisons without
// colliding
private final FloatVectorValues buildVectors;
private final FloatVectorValues.Floats buildVectors;

/**
* Reads all the vectors from vector values, builds a graph connecting them by their dense
* ordinals, using the given hyperparameter settings, and returns the resulting graph.
*
* @param vectors the vectors whose relations are represented by the graph - must provide a
* @param vectorValues the vectors whose relations are represented by the graph - must provide a
* different view over those vectors than the one used to add via addGraphNode.
* @param maxConn the number of connections to make when adding a new graph node; roughly speaking
* the graph fanout.
Expand All @@ -73,14 +73,14 @@ public final class Lucene90HnswGraphBuilder {
* to ensure repeatable construction.
*/
public Lucene90HnswGraphBuilder(
FloatVectorValues vectors,
FloatVectorValues vectorValues,
VectorSimilarityFunction similarityFunction,
int maxConn,
int beamWidth,
long seed)
throws IOException {
vectorValues = vectors.copy();
buildVectors = vectors.copy();
vectors = vectorValues.vectors();
buildVectors = vectorValues.vectors();
this.similarityFunction = Objects.requireNonNull(similarityFunction);
if (maxConn <= 0) {
throw new IllegalArgumentException("maxConn must be positive");
Expand All @@ -101,21 +101,18 @@ public Lucene90HnswGraphBuilder(
* enables efficient retrieval without extra data copying, while avoiding collision of the
* returned values.
*
* @param vectors the vectors for which to build a nearest neighbors graph. Must be an independet
* accessor for the vectors
* @param vectorValues the vectors for which to build a nearest neighbors graph. Must be an
* independent accessor for the vectors
*/
public Lucene90OnHeapHnswGraph build(FloatVectorValues vectors) throws IOException {
if (vectors == vectorValues) {
throw new IllegalArgumentException(
"Vectors to build must be independent of the source of vectors provided to HnswGraphBuilder()");
}
public Lucene90OnHeapHnswGraph build(FloatVectorValues vectorValues) throws IOException {
if (infoStream.isEnabled(HNSW_COMPONENT)) {
infoStream.message(HNSW_COMPONENT, "build graph from " + vectors.size() + " vectors");
infoStream.message(
HNSW_COMPONENT, "build graph from " + vectorValues.size() + " vectorValues");
}
long start = System.nanoTime(), t = start;
// start at node 1! node 0 is added implicitly, in the constructor
for (int node = 1; node < vectors.size(); node++) {
addGraphNode(vectors.vectorValue(node));
for (int node = 1; node < vectorValues.size(); node++) {
addGraphNode(vectors.get(node));
if (node % 10000 == 0) {
if (infoStream.isEnabled(HNSW_COMPONENT)) {
long now = System.nanoTime();
Expand Down Expand Up @@ -147,7 +144,7 @@ void addGraphNode(float[] value) throws IOException {
value,
beamWidth,
beamWidth,
vectorValues,
buildVectors,
similarityFunction,
hnsw,
null,
Expand Down Expand Up @@ -200,7 +197,7 @@ private void selectDiverse(Lucene90NeighborArray neighbors, Lucene90NeighborArra
int cNode = candidates.node()[i];
float cScore = candidates.score()[i];
assert cNode < hnsw.size();
if (diversityCheck(vectorValues.vectorValue(cNode), cScore, neighbors, buildVectors)) {
if (diversityCheck(vectors.get(cNode), cScore, neighbors, buildVectors)) {
neighbors.add(cNode, cScore);
}
}
Expand All @@ -222,20 +219,19 @@ private void popToScratch(NeighborQueue candidates) {
* @param score the score of the new candidate and node n, to be compared with scores of the
* candidate and n's neighbors
* @param neighbors the neighbors selected so far
* @param vectorValues source of values used for making comparisons between candidate and existing
* neighbors
* @param vectors used for making comparisons between candidate and existing neighbors
* @return whether the candidate is diverse given the existing neighbors
*/
private boolean diversityCheck(
float[] candidate,
float score,
Lucene90NeighborArray neighbors,
FloatVectorValues vectorValues)
FloatVectorValues.Floats vectors)
throws IOException {
bound.set(score);
for (int i = 0; i < neighbors.size(); i++) {
float neighborSimilarity =
similarityFunction.compare(candidate, vectorValues.vectorValue(neighbors.node()[i]));
similarityFunction.compare(candidate, vectors.get(neighbors.node()[i]));
if (bound.check(neighborSimilarity) == false) {
return false;
}
Expand Down Expand Up @@ -269,11 +265,10 @@ private int findNonDiverse(Lucene90NeighborArray neighbors) throws IOException {
// them, drop it
int neighborId = neighbors.node()[i];
bound.set(neighbors.score()[i]);
float[] neighborVector = vectorValues.vectorValue(neighborId);
float[] neighborVector = vectors.get(neighborId);
for (int j = maxConn; j > i; j--) {
float neighborSimilarity =
similarityFunction.compare(
neighborVector, buildVectors.vectorValue(neighbors.node()[j]));
similarityFunction.compare(neighborVector, buildVectors.get(neighbors.node()[j]));
if (bound.check(neighborSimilarity) == false) {
// node j is too similar to node i given its score relative to the base node
// replace it with the new node, which is at [maxConn]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@ public void search(String field, float[] target, KnnCollector knnCollector, Bits
target,
knnCollector.k(),
knnCollector.k(),
vectorValues,
vectorValues.vectors(),
fieldEntry.similarityFunction,
getGraphValues(fieldEntry),
getAcceptOrds(acceptDocs, fieldEntry),
Expand Down Expand Up @@ -360,7 +360,6 @@ static class OffHeapFloatVectorValues extends FloatVectorValues {

final int byteSize;
int lastOrd = -1;
final float[] value;
final VectorSimilarityFunction similarityFunction;

OffHeapFloatVectorValues(
Expand All @@ -374,7 +373,6 @@ static class OffHeapFloatVectorValues extends FloatVectorValues {
this.similarityFunction = similarityFunction;

byteSize = Float.BYTES * dimension;
value = new float[dimension];
}

@Override
Expand All @@ -388,19 +386,21 @@ public int size() {
}

@Override
public OffHeapFloatVectorValues copy() {
return new OffHeapFloatVectorValues(dimension, ordToDoc, similarityFunction, dataIn.clone());
}

@Override
public float[] vectorValue(int targetOrd) throws IOException {
if (lastOrd == targetOrd) {
return value;
}
dataIn.seek((long) targetOrd * byteSize);
dataIn.readFloats(value, 0, value.length);
lastOrd = targetOrd;
return value;
public Floats vectors() {
IndexInput input = dataIn.clone();
float[] value = new float[dimension];
return new Floats() {
@Override
public float[] get(int targetOrd) throws IOException {
if (lastOrd == targetOrd) {
return value;
}
input.seek((long) targetOrd * byteSize);
input.readFloats(value, 0, value.length);
lastOrd = targetOrd;
return value;
}
};
}

@Override
Expand All @@ -418,12 +418,12 @@ public VectorScorer scorer(float[] target) {
if (size() == 0) {
return null;
}
OffHeapFloatVectorValues values = this.copy();
DocIndexIterator iterator = values.iterator();
FloatVectorValues.Floats vectors = vectors();
DocIndexIterator iterator = iterator();
return new VectorScorer() {
@Override
public float score() throws IOException {
return values.similarityFunction.compare(values.vectorValue(iterator.index()), target);
return similarityFunction.compare(vectors.get(iterator.index()), target);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ public final class Lucene90OnHeapHnswGraph extends HnswGraph {
* @param topK the number of nodes to be returned
* @param numSeed the size of the queue maintained while searching, and controls the number of
* random entry points to sample
* @param vectors vector values
* @param vectors vectors to search whose ordinals are in the graph
* @param graphValues the graph values. May represent the entire graph, or a level in a
* hierarchical graph.
* @param acceptOrds {@link Bits} that represents the allowed document ordinals to match, or
Expand All @@ -74,7 +74,7 @@ public static NeighborQueue search(
float[] query,
int topK,
int numSeed,
FloatVectorValues vectors,
FloatVectorValues.Floats vectors,
VectorSimilarityFunction similarityFunction,
HnswGraph graphValues,
Bits acceptOrds,
Expand All @@ -101,7 +101,7 @@ public static NeighborQueue search(
break;
}
// explore the topK starting points of some random numSeed probes
float score = similarityFunction.compare(query, vectors.vectorValue(entryPoint));
float score = similarityFunction.compare(query, vectors.get(entryPoint));
candidates.add(entryPoint, score);
if (acceptOrds == null || acceptOrds.get(entryPoint)) {
results.add(entryPoint, score);
Expand Down Expand Up @@ -137,7 +137,7 @@ public static NeighborQueue search(
break;
}

float friendSimilarity = similarityFunction.compare(query, vectors.vectorValue(friendOrd));
float friendSimilarity = similarityFunction.compare(query, vectors.get(friendOrd));
if (results.size() < numSeed || bound.check(friendSimilarity) == false) {
candidates.add(friendOrd, friendSimilarity);
if (acceptOrds == null || acceptOrds.get(friendOrd)) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -401,11 +401,9 @@ static class OffHeapFloatVectorValues extends FloatVectorValues {

private final int dimension;
private final int size;
private final int[] ordToDoc;
private final IntUnaryOperator ordToDocOperator;
private final IndexInput dataIn;
private final int byteSize;
private final float[] value;
private final VectorSimilarityFunction similarityFunction;

OffHeapFloatVectorValues(
Expand All @@ -416,12 +414,10 @@ static class OffHeapFloatVectorValues extends FloatVectorValues {
IndexInput dataIn) {
this.dimension = dimension;
this.size = size;
this.ordToDoc = ordToDoc;
ordToDocOperator = ordToDoc == null ? IntUnaryOperator.identity() : (ord) -> ordToDoc[ord];
this.dataIn = dataIn;
this.similarityFunction = similarityFunction;
byteSize = Float.BYTES * dimension;
value = new float[dimension];
}

@Override
Expand All @@ -435,16 +431,17 @@ public int size() {
}

@Override
public OffHeapFloatVectorValues copy() {
return new OffHeapFloatVectorValues(
dimension, size, ordToDoc, similarityFunction, dataIn.clone());
}

@Override
public float[] vectorValue(int targetOrd) throws IOException {
dataIn.seek((long) targetOrd * byteSize);
dataIn.readFloats(value, 0, value.length);
return value;
public Floats vectors() throws IOException {
IndexInput input = dataIn.clone();
float[] value = new float[dimension];
return new Floats() {
@Override
public float[] get(int targetOrd) throws IOException {
input.seek((long) targetOrd * byteSize);
input.readFloats(value, 0, value.length);
return value;
}
};
}

@Override
Expand All @@ -458,16 +455,16 @@ public DocIndexIterator iterator() {
}

@Override
public VectorScorer scorer(float[] target) {
public VectorScorer scorer(float[] target) throws IOException {
if (size == 0) {
return null;
}
OffHeapFloatVectorValues values = this.copy();
DocIndexIterator iterator = values.iterator();
Floats vectors = vectors();
DocIndexIterator iterator = iterator();
return new VectorScorer() {
@Override
public float score() throws IOException {
return values.similarityFunction.compare(values.vectorValue(iterator.index()), target);
return similarityFunction.compare(vectors.get(iterator.index()), target);
}

@Override
Expand Down
Loading