Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch on batching by default and leave data on host when possible for UMAP #6219

Draft
wants to merge 6 commits into
base: branch-25.02
Choose a base branch
from

Conversation

betatim
Copy link
Member

@betatim betatim commented Jan 13, 2025

Leave data on the host by default and enable batching when using NN descent.

The data_on_host argument to fit and fit_transform is now set to auto. When the brute force algorithm is used nothing changes, as auto will resolve to False. When NN descent is used and the dataset is large enough (and isn't sparse) then it will resolve to True. We need this bit of complexity (going via "auto") as we need a way to have this conditional decision making to match existing behaviour (e.g. switching back to brute force for small datasets).

Currently not all tests pass with nnd_n_clusters > 1, still investigating why that is/what to do about that

Decisions needed:

  • what version to use for the deprecation? Right now used X+2 but maybe that is too soon?

Copy link

copy-pr-bot bot commented Jan 13, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Jan 13, 2025
@betatim
Copy link
Member Author

betatim commented Jan 13, 2025

/ok to test

Leave data on the host by default and enable batching when using NN
descrent.
@betatim
Copy link
Member Author

betatim commented Jan 13, 2025

/ok to test

@betatim
Copy link
Member Author

betatim commented Jan 14, 2025

/ok to test

@betatim
Copy link
Member Author

betatim commented Jan 14, 2025

/ok to test

This stops tests from failing because of an expected warning.
@betatim
Copy link
Member Author

betatim commented Jan 14, 2025

/ok to test

@betatim
Copy link
Member Author

betatim commented Jan 15, 2025

/ok to test

@beckernick
Copy link
Member

Is this impacted by #6216 ?

@betatim
Copy link
Member Author

betatim commented Jan 17, 2025

Probably, I've not run into that particular bug, too many other things to sort out till now. But from looking at the issue I assume this PR won't fix that.

@betatim
Copy link
Member Author

betatim commented Jan 17, 2025

Gave #6216 a try. It kind of reproduces here. With the exact code from the issue you get the same error. When you don't explicitly select nn-descent but leave it set to the new default then you don't get the "illegal memory access". Instead are told that the data can't be on host for the brute force algorithm. This is because for datasets below 50_000 samples that is the selected algorithm (not new).

If you increase the dataset size to 50_000+1 and leave the constructor arguments set to their defaults then you do get the same problem.

I think this is because the brute force algorithm assumes that the data is on the CUDA device, but it isn't. The fix might be to move the data when we are in this situation (automatically switching to brute force). But I guess moving the data might just fail in cases where the user was right to use data_on_host - as in the data is too big to fit. So maybe the right thing to do is to raise an exception telling the user that transform is currently not implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython / Python Cython or Python issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants