Fix a bug in copy() of LogisticRegression that does not infer the penalty cuml parameter #807

lijinf2 · 2024-12-12T20:01:38Z

No description provided.

eordentlich

Can you check if there is a more general solution that invokes code/logic that does this for initialization? In that case, it may resolve similar issues in the future and maybe with other classes?

Signed-off-by: Jinfeng <[email protected]>

lijinf2 · 2024-12-13T00:02:44Z

Can you check if there is a more general solution that invokes code/logic that does this for initialization? In that case, it may resolve similar issues in the future and maybe with other classes?

Revised the PR to make the function independent of LogisticRegression and its parameters. Let me know what you think.

lijinf2 · 2024-12-13T00:03:00Z

build

eordentlich · 2024-12-14T01:25:36Z

python/src/spark_rapids_ml/classification.py

@@ -1163,6 +1163,16 @@ def _set_params(self, **kwargs: Any) -> "LogisticRegression":
            self._set_cuml_reg_params()
        return self

+    def copy(


Is there a way to move this to base class in core and somehow leverage _set_params which already calls set_cuml_reg_params(). Or refactor _set_params a bit to enable this?

Good suggestion. Revised to have this moved to base class in params.py.

lijinf2 · 2024-12-16T22:48:36Z

build

lijinf2 · 2024-12-16T22:53:00Z

@eordentlich The PR has been updated to address comments and includes test cases for all estimators. Please review and let me know if you have any further feedback.

Currently, the copy() function works for all initialization parameters except float32_inputs and num_workers. These need to be converted to Spark Params, and I will make that change if the overall PR looks acceptable

eordentlich · 2024-12-17T04:10:01Z

python/src/spark_rapids_ml/tree.py

@@ -148,8 +154,57 @@ class _RandomForestCumlParams(
    HasFeaturesCols,
    HasLabelCol,
 ):
+
+    n_streams = Param(


Are these converted to Params to enable modification on copy? For spark.mllib estimators I think ok to do that only for mapped params. Or at least is a separate topic.

Yeah, this is to avoid changing function signature: def copy(self: P, extra: Optional["ParamMap"] = None) -> P, and ParamMap = Dict[pyspark.ml.param.Param, Any].

copy() only accepts Param type.

Sorry if I'm missing something - how come we need to declare RF cuml params as Params here and not in other algos?

Yes, the Spark copy() function is designed to work with parameters of the Param type.
Declaring the cuML parameters for RandomForest as Param enables copy() to function correctly for them. For other algorithms, their parameters should have already been declared as Param type (e.g. eps, min_samples of DBSCAN).

What about, say, whiten or svd_solver in PCA? Those are undeclared and seems like adding those to the PCA test would cause an attribute failure.

Right. We will need to declare the cuml-only params (e.g. whiten and svd_solver) one by one as Param, in order to get them working properly.

eordentlich · 2024-12-17T04:10:52Z

python/tests/test_logistic_regression.py

+        ),
+    ],
+)
+def test_copy(


this is used in other tests so seems better to move in a generic central place like utils?

python/src/spark_rapids_ml/params.py

python/tests/test_logistic_regression.py

python/tests/test_umap.py

rishic3 · 2024-12-17T17:13:21Z

python/src/spark_rapids_ml/tree.py

@@ -148,8 +154,57 @@ class _RandomForestCumlParams(
    HasFeaturesCols,
    HasLabelCol,
 ):
+
+    n_streams = Param(


Sorry if I'm missing something - how come we need to declare RF cuml params as Params here and not in other algos?

python/src/spark_rapids_ml/params.py

rishic3 · 2024-12-17T18:01:15Z

Not directly related to this PR but should DBSCAN setters/getters for cuml Params be placed in _DBSCANCumlParams class rather than DBSCAN to align with UMAP/KNN?

lijinf2 · 2024-12-17T20:58:27Z

Not directly related to this PR but should DBSCAN setters/getters for cuml Params be placed in _DBSCANCumlParams class rather than DBSCAN to align with UMAP/KNN?

@rishic3 "Thank you for sharing your thoughts! I have updated the PR accordingly. To ensure proper functionality in copy(), any cuML-only parameter must be declared as a pyspark.Param type. Please take another look when you have a moment.

Regarding the alignment of DBSCAN setters/getters with UMAP/KNN, it seems we may need to review other estimators as well. Perhaps we can create a ticket to track this. Similarly, we should track the broader issue of supporting all cuML-only parameters (e.g., 'whiten' in PCA)."

rishic3 · 2024-12-17T21:58:20Z

LGTM. Thanks @lijinf2!

python/tests/test_kmeans.py

Co-authored-by: Rishi C. <[email protected]>

lijinf2 · 2024-12-17T22:13:59Z

build

eordentlich

👍

lijinf2 changed the title ~~Fix a bug in copy() of LogisticRegression that ignores inferring the penalty cuml parameter~~ Fix a bug in copy() of LogisticRegression that does not infer the penalty cuml parameter Dec 12, 2024

eordentlich reviewed Dec 12, 2024

View reviewed changes

lijinf2 added 2 commits December 12, 2024 22:45

fix LogisticRegression copy ignores penalty

5f92db6

Signed-off-by: Jinfeng <[email protected]>

demo for general solution per comment

82423db

lijinf2 force-pushed the lr_copy_bug branch from fb831f4 to 82423db Compare December 13, 2024 00:01

eordentlich reviewed Dec 14, 2024

View reviewed changes

lijinf2 added 11 commits December 15, 2024 23:48

get copy and its supoort of verbose work for dbscan, lr

c70135b

move copy to base class

9dba6fe

remove clutter and checked nightly passed locally

47122f7

get copy and its support of verbose passed in kmeans

d014104

linear_regression

dbb7bac

get it work for knn, ann

f2982ce

get it works for full lr params except float32_inputs and num_workers

fd608f1

get it work for pca

fefdd47

get copy works for random_forest

ec3abda

support umap

f7377e7

clean

32e8123

eordentlich reviewed Dec 17, 2024

View reviewed changes

rishic3 reviewed Dec 17, 2024

View reviewed changes

move a test function to common and get mypy passed

8668b39

rishic3 reviewed Dec 17, 2024

View reviewed changes

python/tests/test_kmeans.py Outdated Show resolved Hide resolved

Update python/tests/test_kmeans.py

d4cdf33

Co-authored-by: Rishi C. <[email protected]>

eordentlich approved these changes Dec 23, 2024

View reviewed changes

eordentlich merged commit 97938b9 into NVIDIA:branch-24.12 Dec 23, 2024
3 checks passed

lijinf2 deleted the lr_copy_bug branch January 7, 2025 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug in copy() of LogisticRegression that does not infer the penalty cuml parameter #807

Fix a bug in copy() of LogisticRegression that does not infer the penalty cuml parameter #807

lijinf2 commented Dec 12, 2024

eordentlich left a comment

lijinf2 commented Dec 13, 2024

lijinf2 commented Dec 13, 2024

eordentlich Dec 14, 2024

lijinf2 Dec 16, 2024

lijinf2 commented Dec 16, 2024

lijinf2 commented Dec 16, 2024

eordentlich Dec 17, 2024

lijinf2 Dec 17, 2024

rishic3 Dec 17, 2024

lijinf2 Dec 17, 2024

rishic3 Dec 17, 2024

lijinf2 Dec 17, 2024 •

edited

Loading

eordentlich Dec 17, 2024

rishic3 Dec 17, 2024

rishic3 commented Dec 17, 2024

lijinf2 commented Dec 17, 2024 •

edited

Loading

rishic3 commented Dec 17, 2024

lijinf2 commented Dec 17, 2024

eordentlich left a comment

Fix a bug in copy() of LogisticRegression that does not infer the penalty cuml parameter #807

Fix a bug in copy() of LogisticRegression that does not infer the penalty cuml parameter #807

Conversation

lijinf2 commented Dec 12, 2024

eordentlich left a comment

Choose a reason for hiding this comment

lijinf2 commented Dec 13, 2024

lijinf2 commented Dec 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lijinf2 commented Dec 16, 2024

lijinf2 commented Dec 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lijinf2 Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rishic3 commented Dec 17, 2024

lijinf2 commented Dec 17, 2024 • edited Loading

rishic3 commented Dec 17, 2024

lijinf2 commented Dec 17, 2024

eordentlich left a comment

Choose a reason for hiding this comment

lijinf2 Dec 17, 2024 •

edited

Loading

lijinf2 commented Dec 17, 2024 •

edited

Loading