-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add: USearch engine #451
base: main
Are you sure you want to change the base?
Add: USearch engine #451
Conversation
Thanks for adding this! Yeah I'm definitely open to other datasets – I believe there are a few issues/PRs with similar questions (reminds me I really need to go through this – sorry for being so behind) |
As I know, ann-benchmarks works with specific dataset format (i.e h5df), so converting is required. |
I also find those to be exciting datasets. I have used both of them when comparing the quality of cross-modal "semantic joins" with uni-modal embeddings in one of my recent posts. Converting to HDF5 shouldn't be an issue at all. @erikbern, can you please check the conflict with |
@ashvardanian Please rebase your PR against Your implementation is supposed to go into |
With regard to new datasets, it would be great if you could run the benchmark on these datasets and share plots. More diversity is appreciated, also with regard to the observed performance/quality-tradeoff. Thanks! |
@maumueller sounds good! Let's split the work into 2 halves. First, merge USearch, then - new datasets. As for USearch, I am considering a few options for the float:
any:
- base_args: ['@metric']
constructor: USearch
disabled: false
docker_tag: ann-benchmarks-usearch
module: ann_benchmarks.algorithms.usearch
name: usearch
run_groups:
M-12:
arg_groups: [{M: 12, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-16:
arg_groups: [{M: 16, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-24:
arg_groups: [{M: 24, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-36:
arg_groups: [{M: 36, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]`
M-4:
arg_groups: [{M: 4, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-48:
arg_groups: [{M: 48, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-64:
arg_groups: [{M: 64, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-8:
arg_groups: [{M: 8, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-96:
arg_groups: [{M: 96, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]] The current variant is identical to HNSWlib. We, however, also support |
If you want to run them through the benchmark, they should probably be exposed through a parameter in However, note that we run a rather strict timelimit of 2 hours for building + querying. Afterwards, the container will just be killed and you might not get the runs carried out that gave you the best performance. |
@ashvardanian What is the status on this issue? |
Happy to merge this if you want to rebase! |
Hey, @erikbern and @maumueller! As of right now, I'm in the active development phase of USearch v3 and UCall v1, hoping to finish the first by the end of the month. Having USearch in the That said, even the current implementation of HNSW looks highly competitive, especially after 50 Million vectors, but it seems to make more sense for the |
Hi @ashvardanian! This looks interesting, but it seems indeed more suitable for the |
@maumueller, I was also thinking about the As for datasets, the Arxiv Titles and Abstracts encoded with E5, suggested by @kimihailv are indeed a great option! I've recently used them to build a scientific RAG, so they are pretty representative of the embeddings actually used in production. I guess we can use the >>> f = h5py.File('/Users/av/Downloads/glove-25-angular.hdf5', 'r')
>>> f.keys()
<KeysViewHDF5 ['distances', 'neighbors', 'test', 'train']>
>>> f = h5py.File('glove-25-angular.hdf5', 'r')
>>> f.keys()
<KeysViewHDF5 ['distances', 'neighbors', 'test', 'train']>
>>> f["train"].shape
(1183514, 25)
>>> f["distances"].shape
(10000, 100) |
@ashvardanian : the results look good. IIUC the underlying implementation is hnsw , could you please compare the results with nmslib/hnswlib and faiss/hnsw implementation . |
Thanks for open-sourcing this benchmark, @erikbern!
We have had this fork for a while, and it feels like a good time to contribute.
A couple of things to note:
f8
, recall drops to zero on most datasets included in the benchmark. It is different from embeddings produced with most modern encoder Transformers. It depends on the properties of inserted vectors.Addressing the first issue, would it make sense to extend the list of datasets with other embeddings?
We have a few such datasets for ANN benchmarks on our HuggingFace profile.