Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Change default dtype of str.get_dummies() to bool #60641

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

komo-fr
Copy link

@komo-fr komo-fr commented Jan 2, 2025

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Overview

This PR changes the default dtype of the result from str.get_dummies() from np.int64 to boolean (bool or boolean). This modification ensures consistent behavior between pd.get_dummies() and str.get_dummies(). Additionally, the implementation now adapts to the input data type to return the corresponding boolean type.


Background

Currently, pd.get_dummies() returns a boolean dtype by default, while str.get_dummies() returns an integer dtype (np.int64). This inconsistency may cause confusion for users.

This PR aligns the behavior of str.get_dummies() with these changes.


Changes Made

  • Changed the default dtype of the result from str.get_dummies() from np.int64 to boolean.
  • Implemented support for returning the corresponding boolean type when the input data is an ExtensionArray. For example, when the input is of type string[pyarrow], the result will be of type boolean[pyarrow].

Behavior Before and After the Change

Before

>>> sr = pd.Series(["A", "B", "A"])
>>> sr.str.get_dummies()
   A  B
0  1  0
1  0  1
2  1  0

After

>>> sr = pd.Series(["A", "B", "A"])
>>> sr.str.get_dummies()
       A      B
0   True  False
1  False   True
2   True  False

Additional Notes

The following code demonstrates the difference in behavior between pd.get_dummies() and str.get_dummies() before the change:

import numpy as np
import pandas as pd
import pyarrow as pa

sr_list = [pd.Series(["A", "B", "A"]),
           pd.Series(["A", "B", "A"], dtype=pd.StringDtype()),
           pd.Series(["A", "B", "A"], dtype=pd.StringDtype("pyarrow")),
           pd.Series(["A", "B", "A"], dtype=pd.StringDtype("pyarrow_numpy")),
           pd.Series(["A", "B", "A"], dtype=pd.ArrowDtype(pa.string())),
           pd.Series(["A", "B", "A"], dtype="category"),
           pd.Series(["A", "B", "A"], dtype=pd.CategoricalDtype(pd.Index(["A", "B"], dtype=pd.ArrowDtype(pa.string()))))
]

for i, sr in enumerate(sr_list):
    print(f"----- case {i}. {sr.dtype=} -----")
    print(f"pd.get_dummies: {pd.get_dummies(sr)['A'].dtype}")
    print(f"str.get_dummies: {sr.str.get_dummies()['A'].dtype}")

Output Before the Change:

(Note: In case 4, the outputs of pd.get_dummies() and str.get_dummies() already match.)

----- case 0. sr.dtype=dtype('O') -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 1. sr.dtype=string[python] -----
pd.get_dummies: boolean
str.get_dummies: int64
----- case 2. sr.dtype=string[pyarrow] -----
pd.get_dummies: boolean
str.get_dummies: int64
----- case 3. sr.dtype=str -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 4. sr.dtype=string[pyarrow] -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: bool[pyarrow]
----- case 5. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object) -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 6. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=string[pyarrow]) -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: int64

Output After the Change:

----- case 0. sr.dtype=dtype('O') -----
pd.get_dummies: bool
str.get_dummies: bool
----- case 1. sr.dtype=string[python] -----
pd.get_dummies: boolean
str.get_dummies: boolean
----- case 2. sr.dtype=string[pyarrow] -----
pd.get_dummies: boolean
str.get_dummies: boolean
----- case 3. sr.dtype=str -----
pd.get_dummies: bool
str.get_dummies: bool
----- case 4. sr.dtype=string[pyarrow] -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: bool[pyarrow]
----- case 5. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object) -----
pd.get_dummies: bool
str.get_dummies: bool
----- case 6. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=string[pyarrow]) -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: bool[pyarrow]

Comment on lines 45 to +48
def test_get_dummies_index():
# GH9980, GH8028
idx = Index(["a|b", "a|c", "b|c"])
result = idx.str.get_dummies("|")
result = idx.str.get_dummies("|", dtype=np.int64)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior where the output becomes a MultiIndex when the input data is a pd.Index assumes that the dtype is not a boolean type:
https://github.com/pandas-dev/pandas/blob/main/pandas/core/strings/accessor.py#L381-L389

With this PR, the default behavior of str.get_dummies() changes to use a boolean dtype. To ensure the test cases remain consistent with the intended behavior, I modified them to explicitly specify the dtype.

Comment on lines -404 to +412
if len(labels) == 0:
return np.empty(shape=(0, 0), dtype=dtype), labels
dummies = np.vstack(dummies_pa.to_numpy())
_dtype = pandas_dtype(dtype)
dummies_dtype: NpDtype
if isinstance(_dtype, np.dtype):
dummies_dtype = _dtype
else:
dummies_dtype = np.bool_
if len(labels) == 0:
return np.empty(shape=(0, 0), dtype=dummies_dtype), labels
dummies = np.vstack(dummies_pa.to_numpy())
Copy link
Author

@komo-fr komo-fr Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the existing implementation, the following code would raise a TypeError: Cannot interpret 'BooleanDtype' as a data type due to the line return np.empty(shape=(0, 0), dtype=dtype):

# Empty Series
sr = pd.Series(dtype="string[pyarrow]")
sr.str.get_dummies(dtype=pd.BooleanDtype())

With this PR, the default dtype is changed to a boolean type, which makes similar issues more likely to occur. To address this, I modified the code to pass dummies_dtype to np.empty() instead of using dtype directly.

Related test: https://github.com/pandas-dev/pandas/blob/main/pandas/tests/strings/test_strings.py#L136

Comment on lines 2529 to 2548
input_dtype = self._data.dtype
if dtype is None and not isinstance(input_dtype, ArrowDtype):
from pandas.core.arrays.string_ import StringDtype

if isinstance(input_dtype, CategoricalDtype):
input_dtype = input_dtype.categories.dtype

if isinstance(input_dtype, ArrowDtype):
import pyarrow as pa

dtype = ArrowDtype(pa.bool_())
elif (
isinstance(input_dtype, StringDtype)
and input_dtype.na_value is not np.nan
):
from pandas.core.dtypes.common import pandas_dtype

dtype = pandas_dtype("boolean")
else:
dtype = np.bool_
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I based this logic on the existing implementation of pd.get_dummies():
https://github.com/pandas-dev/pandas/blob/v2.2.3/pandas/core/reshape/encoding.py#L252-L269

I added the condition if dtype is None and not isinstance(input_dtype, ArrowDtype): to avoid errors when input_dtype is an ArrowDtype.
The reason is that not excluding ArrowDtype would cause an error with the following code:

sr = pd.Series(["A", "B", "A"], dtype=pd.ArrowDtype(pa.string()))
sr.str.get_dummies(dtype=pd.ArrowDtype(pa.bool_()))

Output (this issue also exists in the implementation before this PR):

...
  File "/Users/komo_fr/P_Project/pandas_workspace/pandas-komo_fr/pandas/core/strings/accessor.py", line 2532, in get_dummies
    DataFrame(result, columns=name, dtype=dtype),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from list<item: bool> to bool using function cast_boolean

With this PR, the default dtype is changed to a boolean type, which makes similar issues more likely to occur.
Since I wasn’t sure how to fully resolve this problem and it could lead to a much larger PR, I chose to exclude ArrowDtype cases for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant