Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cuml parameter setting issues in UMAP/DBSCAN #751

Merged
merged 2 commits into from
Oct 12, 2024

Conversation

rishic3
Copy link
Collaborator

@rishic3 rishic3 commented Oct 8, 2024

Fixed an issue where cuml parameters in UMAP were not being set due to a silent failure in _set_param, where these parameters could not be found in the _param_mapping. Mapping these params to themselves fixes this issue, as it is done in KNN.

DBSCAN had the same issue where cuml parameters were never set, but got away with this by using the Spark API to retrieve parameters and avoiding the use of cuml_params dict entirely.

Changes:

  • Updated both DBSCAN and UMAP to include identity mappings for cuml parameters in _param_mapping.
  • Updated DBSCAN to use the cuml_params dictionary for initialization to be consistent with the other algorithms.
  • Updated the _param_mapping description to avoid future confusion for cuml algorithms without a PySpark equivalent.
  • Added tests for DBSCAN and UMAP to check that cuml parameters are properly set using both the Spark param API and the cuml_params attribute, and test under non-default cuml parameter settings.

This PR also resolves Issue #749, as random_state is now properly set, making dataframe sampling deterministic.

Copy link
Collaborator

@lijinf2 lijinf2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks pretty good.
Just one minor comment.

python/src/spark_rapids_ml/params.py Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Show resolved Hide resolved
python/src/spark_rapids_ml/clustering.py Show resolved Hide resolved
@lijinf2
Copy link
Collaborator

lijinf2 commented Oct 8, 2024

build

@rishic3 rishic3 marked this pull request as ready for review October 9, 2024 16:37
@eordentlich
Copy link
Collaborator

@lijinf2 should @rishic3 add similar tests to this PR for other algo params?

@lijinf2
Copy link
Collaborator

lijinf2 commented Oct 10, 2024

Yeah, that's a good idea.
@rishic3 Can you add similar checking to other algorithms? Each algorithm should have a test_params and we can add "assert est.cuml_params == cuml_params" to the test_params.
Also, is it possible to fix this in case estimator code does not include the "cuml_param": "cuml_param" mappings?

@rishic3 rishic3 merged commit d1c18db into NVIDIA:branch-24.10 Oct 12, 2024
2 checks passed
@rishic3 rishic3 deleted the umap-test branch October 12, 2024 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants