Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LogisticRegression] Match Spark CPU behaviors when dataset has one label #531

Merged
merged 7 commits into from
Dec 28, 2023

Conversation

lijinf2
Copy link
Collaborator

@lijinf2 lijinf2 commented Dec 8, 2023

No description provided.

Signed-off-by: Jinfeng <[email protected]>
@lijinf2
Copy link
Collaborator Author

lijinf2 commented Dec 8, 2023

build

if len(logistic_regression.classes_) == 1:
if init_parameters["fit_intercept"] is True:
model["coef_"] = [[0.0] * logistic_regression.n_cols]
model["intercept_"] = [float("inf")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the sign of this depend on the label value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised to support -inf for label 0.

if len(result["classes_"]) == 1:
if self.getFitIntercept() is False:
print(
"WARNING: All labels belong to a single class and fitIntercept=false. It's a dangerous ground, so the algorithm may not converge."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we match spark's warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to capture spark scala warning in python? I tried caplog.set_level() to INFO, WARN, CRITICAL but got empty log text.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised to use logger.warning

)
else:
print(
"WARNING: All labels are the same value and fitIntercept=true, so the coefficients will be zeros. Training is not needed."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised

python/tests/test_logistic_regression.py Outdated Show resolved Hide resolved
assert blor_model.intercept == 0.0
else:
assert array_equal(blor_model.coefficients.toArray(), [0, 0], 0.0)
assert blor_model.intercept == float("inf")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check what happens also in the case if all labels are 0 instead of 1 (i.e. y).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

class_val = logistic_regression.classes_[0]
assert (
class_val == 1.0 or class_val == 0.0
), "class value must be either 1. or 0. when dataset has one label"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does spark do if label has one value but is not 1 or 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised.
if label < 0, a java runtimeError pops up.
If label > 1, spark trains a multinomial classification, cuml trains a single-class classification due to using y.unique().

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Dec 13, 2023

build

blor_model = blor.fit(bdf)

if fit_intercept is False:
if label == 1.0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eordentlich
Copy link
Collaborator

Any update on this?

@eordentlich
Copy link
Collaborator

You will probably need to patch the ci docker image in this pr to get tests to pass as rapidsai-nightly no longer has cuml 23.12. switch to rapidsai channel.

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Dec 27, 2023

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Dec 27, 2023

You will probably need to patch the ci docker image in this pr to get tests to pass as rapidsai-nightly no longer has cuml 23.12. switch to rapidsai channel.

Added the caplog, and a test case to check invalid label. Just updated ci docker image and yes seems ci can run.

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Dec 27, 2023

build

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lijinf2 lijinf2 merged commit 215e623 into NVIDIA:branch-23.12 Dec 28, 2023
2 checks passed
@lijinf2 lijinf2 deleted the lr_onelabel branch March 6, 2024 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants