Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - cuML LabelEncoder is 200x slower with cuDF-Pandas vs cuDF #6232

Open
cdeotte opened this issue Jan 17, 2025 · 0 comments
Open

[BUG] - cuML LabelEncoder is 200x slower with cuDF-Pandas vs cuDF #6232

cdeotte opened this issue Jan 17, 2025 · 0 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@cdeotte
Copy link

cdeotte commented Jan 17, 2025

Describe the bug
When Label Encoding a column of strings, if we input a cuDF-Pandas dataframe then cuML Label Encoder is 200x slower than inputting a pure cuDF dataframe

Steps/Code to reproduce bug

%load_ext cudf.pandas

from time import time
import random, string
import pandas as pd, numpy as np, cudf
from cuml.preprocessing import LabelEncoder

def generate_unique_strings(count, length):
    chars = string.ascii_letters + string.digits
    unique_strings = set()

    while len(unique_strings) < count:
        new_string = ''.join(random.choices(chars, k=length))
        unique_strings.add(new_string)

    return list(unique_strings)

unique_strings = generate_unique_strings(1000, 13)
strings = np.random.choice(unique_strings,1_000_000,replace=True)

df_cudf_pandas = pd.DataFrame(strings)
LE = LabelEncoder()
start = time()
df_cudf_pandas[0] = LE.fit_transform(df_cudf_pandas[0])
elapsed_cudf_pandas = time()-start
print(f"cuML Label Encoder with cuDF-Pandas took {elapsed_cudf_pandas:.5f} seconds")

df_cudf = cudf.DataFrame(strings)
LE = LabelEncoder()
start = time()
df_cudf[0] = LE.fit_transform(df_cudf[0])
elapsed_cudf = time()-start
print(f"cuML Label Encoder with pure cuDF took {elapsed_cudf:.5f} seconds")

slowdown = elapsed_cudf_pandas / elapsed_cudf
print(f"cuML Label Encoder is slowdowned by a factor of {slowdown:.0f}x using cuDF-Pandas")

Expected behavior
We would expect cuDF-Pandas to have same speed as pure cuDF. Not 200x slower.

Environment details (please complete the following information):
RAPIDS 24.12

@cdeotte cdeotte added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant