Switch on batching by default and leave data on host when possible for `UMAP` #6219

betatim · 2025-01-13T12:52:48Z

Leave data on the host by default and enable batching when using NN descent.

The data_on_host argument to fit and fit_transform is now set to auto. When the brute force algorithm is used nothing changes, as auto will resolve to False. When NN descent is used and the dataset is large enough (and isn't sparse) then it will resolve to True. We need this bit of complexity (going via "auto") as we need a way to have this conditional decision making to match existing behaviour (e.g. switching back to brute force for small datasets).

Currently not all tests pass with nnd_n_clusters > 1, still investigating why that is/what to do about that

Decisions needed:

what version to use for the deprecation? Right now used X+2 but maybe that is too soon?

copy-pr-bot · 2025-01-13T12:52:51Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

betatim · 2025-01-13T12:53:06Z

/ok to test

Leave data on the host by default and enable batching when using NN descrent.

betatim · 2025-01-13T13:09:43Z

/ok to test

betatim · 2025-01-14T08:52:49Z

/ok to test

betatim · 2025-01-14T12:41:40Z

/ok to test

This stops tests from failing because of an expected warning.

betatim · 2025-01-14T13:49:29Z

/ok to test

betatim · 2025-01-15T17:19:43Z

/ok to test

beckernick · 2025-01-16T16:12:13Z

Is this impacted by #6216 ?

betatim · 2025-01-17T08:49:42Z

Probably, I've not run into that particular bug, too many other things to sort out till now. But from looking at the issue I assume this PR won't fix that.

betatim · 2025-01-17T16:33:09Z

Gave #6216 a try. It kind of reproduces here. With the exact code from the issue you get the same error. When you don't explicitly select nn-descent but leave it set to the new default then you don't get the "illegal memory access". Instead are told that the data can't be on host for the brute force algorithm. This is because for datasets below 50_000 samples that is the selected algorithm (not new).

If you increase the dataset size to 50_000+1 and leave the constructor arguments set to their defaults then you do get the same problem.

I think this is because the brute force algorithm assumes that the data is on the CUDA device, but it isn't. The fix might be to move the data when we are in this situation (automatically switching to brute force). But I guess moving the data might just fail in cases where the user was right to use data_on_host - as in the data is too big to fit. So maybe the right thing to do is to raise an exception telling the user that transform is currently not implemented.

github-actions bot added the Cython / Python Cython or Python issue label Jan 13, 2025

Switch to new default for UMAP

6bb9600

Leave data on the host by default and enable batching when using NN descrent.

betatim force-pushed the new-umap-defaults branch from 4bc18a3 to 6bb9600 Compare January 13, 2025 13:06

Handle CPU/GPU interop

52b0bfa

Merge branch 'branch-25.02' into new-umap-defaults

70750de

Mark tests with warning filter

56c1e15

This stops tests from failing because of an expected warning.

Filter deprecation warning

ed127a9

Ignroe deprecation warnings related to n_clusters

d2b96e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch on batching by default and leave data on host when possible for `UMAP` #6219

Switch on batching by default and leave data on host when possible for `UMAP` #6219

betatim commented Jan 13, 2025

copy-pr-bot bot commented Jan 13, 2025

betatim commented Jan 13, 2025

betatim commented Jan 13, 2025

betatim commented Jan 14, 2025

betatim commented Jan 14, 2025

betatim commented Jan 14, 2025

betatim commented Jan 15, 2025

beckernick commented Jan 16, 2025

betatim commented Jan 17, 2025

betatim commented Jan 17, 2025

Switch on batching by default and leave data on host when possible for UMAP #6219

Are you sure you want to change the base?

Switch on batching by default and leave data on host when possible for UMAP #6219

Conversation

betatim commented Jan 13, 2025

copy-pr-bot bot commented Jan 13, 2025

betatim commented Jan 13, 2025

betatim commented Jan 13, 2025

betatim commented Jan 14, 2025

betatim commented Jan 14, 2025

betatim commented Jan 14, 2025

betatim commented Jan 15, 2025

beckernick commented Jan 16, 2025

betatim commented Jan 17, 2025

betatim commented Jan 17, 2025

Switch on batching by default and leave data on host when possible for `UMAP` #6219

Switch on batching by default and leave data on host when possible for `UMAP` #6219