Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug in copy() of LogisticRegression that does not infer the penalty cuml parameter #807

Merged
merged 15 commits into from
Dec 23, 2024

Conversation

lijinf2
Copy link
Collaborator

@lijinf2 lijinf2 commented Dec 12, 2024

No description provided.

@lijinf2 lijinf2 changed the title Fix a bug in copy() of LogisticRegression that ignores inferring the penalty cuml parameter Fix a bug in copy() of LogisticRegression that does not infer the penalty cuml parameter Dec 12, 2024
Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if there is a more general solution that invokes code/logic that does this for initialization? In that case, it may resolve similar issues in the future and maybe with other classes?

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Dec 13, 2024

Can you check if there is a more general solution that invokes code/logic that does this for initialization? In that case, it may resolve similar issues in the future and maybe with other classes?

Revised the PR to make the function independent of LogisticRegression and its parameters. Let me know what you think.

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Dec 13, 2024

build

@@ -1163,6 +1163,16 @@ def _set_params(self, **kwargs: Any) -> "LogisticRegression":
self._set_cuml_reg_params()
return self

def copy(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to move this to base class in core and somehow leverage _set_params which already calls set_cuml_reg_params(). Or refactor _set_params a bit to enable this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Revised to have this moved to base class in params.py.

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Dec 16, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Dec 16, 2024

@eordentlich The PR has been updated to address comments and includes test cases for all estimators. Please review and let me know if you have any further feedback.

Currently, the copy() function works for all initialization parameters except float32_inputs and num_workers. These need to be converted to Spark Params, and I will make that change if the overall PR looks acceptable

@@ -148,8 +154,57 @@ class _RandomForestCumlParams(
HasFeaturesCols,
HasLabelCol,
):

n_streams = Param(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these converted to Params to enable modification on copy? For spark.mllib estimators I think ok to do that only for mapped params. Or at least is a separate topic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is to avoid changing function signature: def copy(self: P, extra: Optional["ParamMap"] = None) -> P, and ParamMap = Dict[pyspark.ml.param.Param, Any].

copy() only accepts Param type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm missing something - how come we need to declare RF cuml params as Params here and not in other algos?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the Spark copy() function is designed to work with parameters of the Param type.
Declaring the cuML parameters for RandomForest as Param enables copy() to function correctly for them. For other algorithms, their parameters should have already been declared as Param type (e.g. eps, min_samples of DBSCAN).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about, say, whiten or svd_solver in PCA? Those are undeclared and seems like adding those to the PCA test would cause an attribute failure.

Copy link
Collaborator Author

@lijinf2 lijinf2 Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We will need to declare the cuml-only params (e.g. whiten and svd_solver) one by one as Param, in order to get them working properly.

),
],
)
def test_copy(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used in other tests so seems better to move in a generic central place like utils?

python/src/spark_rapids_ml/params.py Show resolved Hide resolved
python/tests/test_logistic_regression.py Outdated Show resolved Hide resolved
python/tests/test_umap.py Outdated Show resolved Hide resolved
@@ -148,8 +154,57 @@ class _RandomForestCumlParams(
HasFeaturesCols,
HasLabelCol,
):

n_streams = Param(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm missing something - how come we need to declare RF cuml params as Params here and not in other algos?

python/src/spark_rapids_ml/params.py Show resolved Hide resolved
@rishic3
Copy link
Collaborator

rishic3 commented Dec 17, 2024

Not directly related to this PR but should DBSCAN setters/getters for cuml Params be placed in _DBSCANCumlParams class rather than DBSCAN to align with UMAP/KNN?

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Dec 17, 2024

Not directly related to this PR but should DBSCAN setters/getters for cuml Params be placed in _DBSCANCumlParams class rather than DBSCAN to align with UMAP/KNN?

@rishic3 "Thank you for sharing your thoughts! I have updated the PR accordingly. To ensure proper functionality in copy(), any cuML-only parameter must be declared as a pyspark.Param type. Please take another look when you have a moment.

Regarding the alignment of DBSCAN setters/getters with UMAP/KNN, it seems we may need to review other estimators as well. Perhaps we can create a ticket to track this. Similarly, we should track the broader issue of supporting all cuML-only parameters (e.g., 'whiten' in PCA)."

@rishic3
Copy link
Collaborator

rishic3 commented Dec 17, 2024

LGTM. Thanks @lijinf2!

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Dec 17, 2024

build

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@eordentlich eordentlich merged commit 97938b9 into NVIDIA:branch-24.12 Dec 23, 2024
3 checks passed
@lijinf2 lijinf2 deleted the lr_copy_bug branch January 7, 2025 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants