Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diskann Benchmarking Wrapper #260

Open
wants to merge 112 commits into
base: branch-25.02
Choose a base branch
from

Conversation

tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Jul 29, 2024

Brings DiskANN into cuvs-bench

  • Build and search in-memory DiskANN index
  • Build and search SSD DiskANN index
  • Build a cuvs Vamana index on GPU and serialize it in DiskANN format. Search on CPU using in-memory DiskANN search API.

@tarang-jain tarang-jain added feature request New feature or request non-breaking Introduces a non-breaking change labels Jul 29, 2024
@tarang-jain tarang-jain self-assigned this Jul 29, 2024
@github-actions github-actions bot added the Python label Aug 3, 2024
@@ -32,6 +32,12 @@ option(CUVS_ANN_BENCH_USE_CUVS_BRUTE_FORCE "Include cuVS brute force knn in benc
option(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB "Include cuVS CAGRA with HNSW search in benchmark" ON)
option(CUVS_ANN_BENCH_USE_HNSWLIB "Include hnsw algorithm in benchmark" ON)
option(CUVS_ANN_BENCH_USE_GGNN "Include ggnn algorithm in benchmark" OFF)
option(CUVS_ANN_BENCH_USE_DISKANN "Include DISKANN search in benchmark" ON)
option(CUVS_ANN_BENCH_USE_CUVS_VAMANA "Include cuVS Vamana with DiskANN search in benchmark" ON)
if(CMAKE_SYSTEM_PROCESSOR MATCHES "(ARM|arm|aarch64)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does MSFT DiskANN repo not support ARM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they have mkl-devel as a dependency, which is not meant to be installed in aarch64.

cpp/bench/ann/src/cuvs/cuvs_cagra_diskann_wrapper.h Outdated Show resolved Hide resolved
template <typename T>
void diskann_ssd<T>::build(const T* dataset, size_t nrow)
{
diskann::build_disk_index<float>(base_file_.c_str(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this defined? Is it using the CPU DiskANN repo to build the ssd verson of the index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is right. It is just to benchmark the CPU build

@cjnolet
Copy link
Member

cjnolet commented Dec 5, 2024

/ok to test

@tarang-jain
Copy link
Contributor Author

/ok to test

@cjnolet cjnolet changed the base branch from branch-24.12 to branch-25.02 December 10, 2024 20:47
@cjnolet
Copy link
Member

cjnolet commented Dec 12, 2024

/ok to test

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid algorithm-specific changes to cpp/bench/ann/src/common/benchmark.hpp as per our offline discussion.
I'm not opposed to exposing dataset/index filepaths to algorithms in general though.
But maybe it would make sense to hide (filter out) them when dumping all build parameters to avoid cluttering the console/reports and exposing private user filepaths?

cpp/bench/ann/src/cuvs/cuvs_vamana_wrapper.h Show resolved Hide resolved
@tarang-jain
Copy link
Contributor Author

@achirkin are you suggesting that we expose the dataset / index filepaths to all algorithms, but only use them for diskann ssd indexes?

@@ -144,7 +150,8 @@ void bench_build(::benchmark::State& state,

const auto algo_property = parse_algo_property(algo->get_preference(), index.build_param);

const T* base_set = dataset->base_set(algo_property.dataset_memory_type);
const T* base_set = nullptr;
if (index.algo != "diskann_ssd") base_set = dataset->base_set(algo_property.dataset_memory_type);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achirkin if we do not have this line, the entire dataset will be read into the base_set variable needlessly for DiskANN ssd based indexes. For these indexes, only the path to the dataset needs to be known and the DiskANN index building APIs will read the data from the path. In fact, it is important we do it this way because those SSD indexes are designed to let the data remain on disk if there are extremely large datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just set the dataset_memory_type to kHostMmap by default for diskann index. It will lazily map the file content, so it won't read anything until you actually use the pointer.
It will still open the file and read the first two words anyway - independently of this change. That is because the benchmark reads and reports the dataset dimensionality.

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to always adhere to these two rules:

  • No algorithm-specific code in the common header files, such as benchmark.hpp
  • All configuration-specific code should go into conf.hpp, not into benchmark.hpp.

@@ -144,7 +150,8 @@ void bench_build(::benchmark::State& state,

const auto algo_property = parse_algo_property(algo->get_preference(), index.build_param);

const T* base_set = dataset->base_set(algo_property.dataset_memory_type);
const T* base_set = nullptr;
if (index.algo != "diskann_ssd") base_set = dataset->base_set(algo_property.dataset_memory_type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just set the dataset_memory_type to kHostMmap by default for diskann index. It will lazily map the file content, so it won't read anything until you actually use the pointer.
It will still open the file and read the first two words anyway - independently of this change. That is because the benchmark reads and reports the dataset dimensionality.

Comment on lines +138 to +143
if (index.algo == "diskann_ssd") {
make_sure_parent_dir_exists(index.file);
index.build_param["dataset_file"] = dataset->base_filename();
index.build_param["path_to_index"] = index.file;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you shouldn't need path_to_index parameters, because the index member file is accessible from within the index in your algorithm-specific code.
If you need the dataset file path during index build, you can either ingest it as a build param in the conf.hpp, similar to how we inject some search parameters in parse_index(...), or create one more member in the configuration::index - all in conf.hpp; I think it's ok to have it available to all algorithm at that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the cuvs::bench::configuration::index object's attributes are not directly passed down to the specific algorithm's wrapper class. So we would need the path_to_index. The algorithm specific structs use only the parameters that are explicitly passed down from the configuration object.

Copy link
Member

@cjnolet cjnolet Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is: CAN we pass the cuvs::bench::configuration::Index down into the wrappers? Or rather, any reason we shouldn't do it?

Comment on lines +234 to +237
if (index.algo != "diskann_ssd")
filename = index.file;
else
filename = index.file + "_disk.index";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem, the algorithm-specific code should go to the algorithm headers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for checking if the index file exists. The diskann ssd index is saved with the suffix _disk.index along with the filename.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is algorithm-specific code in common benchmarking code, though. This should be moved into the diskann portions of the code. You have to understand that we create generalized abstractions for this very purpose- anything in common/benchmark.hpp should be applied to all index algorithms and any specific things should be propagated down to the respective index types. Polluting the common code like this moves up the abstraction tree, rather than down, and it results in hard to maintain spaghetti code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another reason to propagate this down to the individual wrappers. You could at least specify a method on the wrapper class to allow any algorithms to override and adjust the filename as needed. That would be more robust than one-offs like this.

@cjnolet
Copy link
Member

cjnolet commented Jan 30, 2025

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarking CMake cpp feature request New feature or request non-breaking Introduces a non-breaking change Python
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

6 participants