-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Change default dtype of str.get_dummies() to bool #60641
base: main
Are you sure you want to change the base?
ENH: Change default dtype of str.get_dummies() to bool #60641
Conversation
def test_get_dummies_index(): | ||
# GH9980, GH8028 | ||
idx = Index(["a|b", "a|c", "b|c"]) | ||
result = idx.str.get_dummies("|") | ||
result = idx.str.get_dummies("|", dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior where the output becomes a MultiIndex
when the input data is a pd.Index
assumes that the dtype
is not a boolean type:
https://github.com/pandas-dev/pandas/blob/main/pandas/core/strings/accessor.py#L381-L389
With this PR, the default behavior of str.get_dummies()
changes to use a boolean dtype. To ensure the test cases remain consistent with the intended behavior, I modified them to explicitly specify the dtype
.
if len(labels) == 0: | ||
return np.empty(shape=(0, 0), dtype=dtype), labels | ||
dummies = np.vstack(dummies_pa.to_numpy()) | ||
_dtype = pandas_dtype(dtype) | ||
dummies_dtype: NpDtype | ||
if isinstance(_dtype, np.dtype): | ||
dummies_dtype = _dtype | ||
else: | ||
dummies_dtype = np.bool_ | ||
if len(labels) == 0: | ||
return np.empty(shape=(0, 0), dtype=dummies_dtype), labels | ||
dummies = np.vstack(dummies_pa.to_numpy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the existing implementation, the following code would raise a TypeError: Cannot interpret 'BooleanDtype' as a data type
due to the line return np.empty(shape=(0, 0), dtype=dtype)
:
# Empty Series
sr = pd.Series(dtype="string[pyarrow]")
sr.str.get_dummies(dtype=pd.BooleanDtype())
With this PR, the default dtype is changed to a boolean type, which makes similar issues more likely to occur. To address this, I modified the code to pass dummies_dtype
to np.empty()
instead of using dtype
directly.
Related test: https://github.com/pandas-dev/pandas/blob/main/pandas/tests/strings/test_strings.py#L136
pandas/core/strings/accessor.py
Outdated
input_dtype = self._data.dtype | ||
if dtype is None and not isinstance(input_dtype, ArrowDtype): | ||
from pandas.core.arrays.string_ import StringDtype | ||
|
||
if isinstance(input_dtype, CategoricalDtype): | ||
input_dtype = input_dtype.categories.dtype | ||
|
||
if isinstance(input_dtype, ArrowDtype): | ||
import pyarrow as pa | ||
|
||
dtype = ArrowDtype(pa.bool_()) | ||
elif ( | ||
isinstance(input_dtype, StringDtype) | ||
and input_dtype.na_value is not np.nan | ||
): | ||
from pandas.core.dtypes.common import pandas_dtype | ||
|
||
dtype = pandas_dtype("boolean") | ||
else: | ||
dtype = np.bool_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I based this logic on the existing implementation of pd.get_dummies()
:
https://github.com/pandas-dev/pandas/blob/v2.2.3/pandas/core/reshape/encoding.py#L252-L269
I added the condition if dtype is None and not isinstance(input_dtype, ArrowDtype):
to avoid errors when input_dtype
is an ArrowDtype
.
The reason is that not excluding ArrowDtype
would cause an error with the following code:
sr = pd.Series(["A", "B", "A"], dtype=pd.ArrowDtype(pa.string()))
sr.str.get_dummies(dtype=pd.ArrowDtype(pa.bool_()))
Output (this issue also exists in the implementation before this PR):
...
File "/Users/komo_fr/P_Project/pandas_workspace/pandas-komo_fr/pandas/core/strings/accessor.py", line 2532, in get_dummies
DataFrame(result, columns=name, dtype=dtype),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from list<item: bool> to bool using function cast_boolean
With this PR, the default dtype is changed to a boolean type, which makes similar issues more likely to occur.
Since I wasn’t sure how to fully resolve this problem and it could lead to a much larger PR, I chose to exclude ArrowDtype cases for now.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Overview
This PR changes the default dtype of the result from
str.get_dummies()
fromnp.int64
to boolean (bool
orboolean
). This modification ensures consistent behavior betweenpd.get_dummies()
andstr.get_dummies()
. Additionally, the implementation now adapts to the input data type to return the corresponding boolean type.Background
Currently,
pd.get_dummies()
returns a boolean dtype by default, whilestr.get_dummies()
returns an integer dtype (np.int64
). This inconsistency may cause confusion for users.pd.get_dummies()
was changed to bool.pd.get_dummies()
was updated to return a corresponding boolean dtype (e.g.,boolean[pyarrow]
) when the input is anExtensionArray
.This PR aligns the behavior of
str.get_dummies()
with these changes.Changes Made
str.get_dummies()
fromnp.int64
to boolean.ExtensionArray
. For example, when the input is of typestring[pyarrow]
, the result will be of typeboolean[pyarrow]
.Behavior Before and After the Change
Before
After
Additional Notes
The following code demonstrates the difference in behavior between
pd.get_dummies()
andstr.get_dummies()
before the change:Output Before the Change:
(Note: In case 4, the outputs of
pd.get_dummies()
andstr.get_dummies()
already match.)Output After the Change: