-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a bug in copy() of LogisticRegression that does not infer the penalty cuml parameter #807
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check if there is a more general solution that invokes code/logic that does this for initialization? In that case, it may resolve similar issues in the future and maybe with other classes?
Signed-off-by: Jinfeng <[email protected]>
Revised the PR to make the function independent of LogisticRegression and its parameters. Let me know what you think. |
build |
@@ -1163,6 +1163,16 @@ def _set_params(self, **kwargs: Any) -> "LogisticRegression": | |||
self._set_cuml_reg_params() | |||
return self | |||
|
|||
def copy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to move this to base class in core and somehow leverage _set_params which already calls set_cuml_reg_params(). Or refactor _set_params a bit to enable this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Revised to have this moved to base class in params.py.
build |
@eordentlich The PR has been updated to address comments and includes test cases for all estimators. Please review and let me know if you have any further feedback. Currently, the |
@@ -148,8 +154,57 @@ class _RandomForestCumlParams( | |||
HasFeaturesCols, | |||
HasLabelCol, | |||
): | |||
|
|||
n_streams = Param( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these converted to Params to enable modification on copy? For spark.mllib estimators I think ok to do that only for mapped params. Or at least is a separate topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is to avoid changing function signature: def copy(self: P, extra: Optional["ParamMap"] = None) -> P, and ParamMap = Dict[pyspark.ml.param.Param, Any].
copy() only accepts Param type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm missing something - how come we need to declare RF cuml params as Params here and not in other algos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the Spark copy() function is designed to work with parameters of the Param type.
Declaring the cuML parameters for RandomForest as Param enables copy() to function correctly for them. For other algorithms, their parameters should have already been declared as Param type (e.g. eps, min_samples of DBSCAN).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about, say, whiten
or svd_solver
in PCA? Those are undeclared and seems like adding those to the PCA test would cause an attribute failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We will need to declare the cuml-only params (e.g. whiten
and svd_solver
) one by one as Param, in order to get them working properly.
), | ||
], | ||
) | ||
def test_copy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is used in other tests so seems better to move in a generic central place like utils?
@@ -148,8 +154,57 @@ class _RandomForestCumlParams( | |||
HasFeaturesCols, | |||
HasLabelCol, | |||
): | |||
|
|||
n_streams = Param( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm missing something - how come we need to declare RF cuml params as Params here and not in other algos?
Not directly related to this PR but should DBSCAN setters/getters for cuml Params be placed in _DBSCANCumlParams class rather than DBSCAN to align with UMAP/KNN? |
@rishic3 "Thank you for sharing your thoughts! I have updated the PR accordingly. To ensure proper functionality in copy(), any cuML-only parameter must be declared as a pyspark.Param type. Please take another look when you have a moment. Regarding the alignment of DBSCAN setters/getters with UMAP/KNN, it seems we may need to review other estimators as well. Perhaps we can create a ticket to track this. Similarly, we should track the broader issue of supporting all cuML-only parameters (e.g., 'whiten' in PCA)." |
LGTM. Thanks @lijinf2! |
Co-authored-by: Rishi C. <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
No description provided.