ENH: Change default dtype of str.get_dummies() to bool #60641

komo-fr · 2025-01-02T08:11:29Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Overview

This PR changes the default dtype of the result from str.get_dummies() from np.int64 to boolean (bool or boolean). This modification ensures consistent behavior between pd.get_dummies() and str.get_dummies(). Additionally, the implementation now adapts to the input data type to return the corresponding boolean type.

Background

Currently, pd.get_dummies() returns a boolean dtype by default, while str.get_dummies() returns an integer dtype (np.int64). This inconsistency may cause confusion for users.

In PR ENH: change get_dummies default dtype to bool #48022, the default dtype of pd.get_dummies() was changed to bool.
In PR ENH: Make get_dummies return ea booleans for ea inputs #56291, pd.get_dummies() was updated to return a corresponding boolean dtype (e.g., boolean[pyarrow]) when the input is an ExtensionArray.

This PR aligns the behavior of str.get_dummies() with these changes.

Changes Made

Changed the default dtype of the result from str.get_dummies() from np.int64 to boolean.
Implemented support for returning the corresponding boolean type when the input data is an ExtensionArray. For example, when the input is of type string[pyarrow], the result will be of type boolean[pyarrow].

Behavior Before and After the Change

Before

>>> sr = pd.Series(["A", "B", "A"])
>>> sr.str.get_dummies()
   A  B
0  1  0
1  0  1
2  1  0

After

>>> sr = pd.Series(["A", "B", "A"])
>>> sr.str.get_dummies()
       A      B
0   True  False
1  False   True
2   True  False

Additional Notes

The following code demonstrates the difference in behavior between pd.get_dummies() and str.get_dummies() before the change:

import numpy as np
import pandas as pd
import pyarrow as pa

sr_list = [pd.Series(["A", "B", "A"]),
           pd.Series(["A", "B", "A"], dtype=pd.StringDtype()),
           pd.Series(["A", "B", "A"], dtype=pd.StringDtype("pyarrow")),
           pd.Series(["A", "B", "A"], dtype=pd.StringDtype("pyarrow_numpy")),
           pd.Series(["A", "B", "A"], dtype=pd.ArrowDtype(pa.string())),
           pd.Series(["A", "B", "A"], dtype="category"),
           pd.Series(["A", "B", "A"], dtype=pd.CategoricalDtype(pd.Index(["A", "B"], dtype=pd.ArrowDtype(pa.string()))))
]

for i, sr in enumerate(sr_list):
    print(f"----- case {i}. {sr.dtype=} -----")
    print(f"pd.get_dummies: {pd.get_dummies(sr)['A'].dtype}")
    print(f"str.get_dummies: {sr.str.get_dummies()['A'].dtype}")

Output Before the Change:

(Note: In case 4, the outputs of pd.get_dummies() and str.get_dummies() already match.)

----- case 0. sr.dtype=dtype('O') -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 1. sr.dtype=string[python] -----
pd.get_dummies: boolean
str.get_dummies: int64
----- case 2. sr.dtype=string[pyarrow] -----
pd.get_dummies: boolean
str.get_dummies: int64
----- case 3. sr.dtype=str -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 4. sr.dtype=string[pyarrow] -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: bool[pyarrow]
----- case 5. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object) -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 6. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=string[pyarrow]) -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: int64

Output After the Change:

----- case 0. sr.dtype=dtype('O') -----
pd.get_dummies: bool
str.get_dummies: bool
----- case 1. sr.dtype=string[python] -----
pd.get_dummies: boolean
str.get_dummies: boolean
----- case 2. sr.dtype=string[pyarrow] -----
pd.get_dummies: boolean
str.get_dummies: boolean
----- case 3. sr.dtype=str -----
pd.get_dummies: bool
str.get_dummies: bool
----- case 4. sr.dtype=string[pyarrow] -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: bool[pyarrow]
----- case 5. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object) -----
pd.get_dummies: bool
str.get_dummies: bool
----- case 6. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=string[pyarrow]) -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: bool[pyarrow]

komo-fr · 2025-01-02T08:12:16Z

pandas/tests/strings/test_get_dummies.py

 def test_get_dummies_index():
    # GH9980, GH8028
    idx = Index(["a|b", "a|c", "b|c"])
-    result = idx.str.get_dummies("|")
+    result = idx.str.get_dummies("|", dtype=np.int64)


The behavior where the output becomes a MultiIndex when the input data is a pd.Index assumes that the dtype is not a boolean type:
https://github.com/pandas-dev/pandas/blob/main/pandas/core/strings/accessor.py#L381-L389

With this PR, the default behavior of str.get_dummies() changes to use a boolean dtype. To ensure the test cases remain consistent with the intended behavior, I modified them to explicitly specify the dtype.

komo-fr · 2025-01-02T08:12:49Z

pandas/core/arrays/string_arrow.py

-        if len(labels) == 0:
-            return np.empty(shape=(0, 0), dtype=dtype), labels
-        dummies = np.vstack(dummies_pa.to_numpy())
        _dtype = pandas_dtype(dtype)
        dummies_dtype: NpDtype
        if isinstance(_dtype, np.dtype):
            dummies_dtype = _dtype
        else:
            dummies_dtype = np.bool_
+        if len(labels) == 0:
+            return np.empty(shape=(0, 0), dtype=dummies_dtype), labels
+        dummies = np.vstack(dummies_pa.to_numpy())


In the existing implementation, the following code would raise a TypeError: Cannot interpret 'BooleanDtype' as a data type due to the line return np.empty(shape=(0, 0), dtype=dtype):

# Empty Series sr = pd.Series(dtype="string[pyarrow]") sr.str.get_dummies(dtype=pd.BooleanDtype())

With this PR, the default dtype is changed to a boolean type, which makes similar issues more likely to occur. To address this, I modified the code to pass dummies_dtype to np.empty() instead of using dtype directly.

Related test: https://github.com/pandas-dev/pandas/blob/main/pandas/tests/strings/test_strings.py#L136

komo-fr · 2025-01-02T08:13:46Z

pandas/core/strings/accessor.py

+        input_dtype = self._data.dtype
+        if dtype is None and not isinstance(input_dtype, ArrowDtype):
+            from pandas.core.arrays.string_ import StringDtype
+
+            if isinstance(input_dtype, CategoricalDtype):
+                input_dtype = input_dtype.categories.dtype
+
+            if isinstance(input_dtype, ArrowDtype):
+                import pyarrow as pa
+
+                dtype = ArrowDtype(pa.bool_())
+            elif (
+                isinstance(input_dtype, StringDtype)
+                and input_dtype.na_value is not np.nan
+            ):
+                from pandas.core.dtypes.common import pandas_dtype
+
+                dtype = pandas_dtype("boolean")
+            else:
+                dtype = np.bool_


I based this logic on the existing implementation of pd.get_dummies():
https://github.com/pandas-dev/pandas/blob/v2.2.3/pandas/core/reshape/encoding.py#L252-L269

I added the condition if dtype is None and not isinstance(input_dtype, ArrowDtype): to avoid errors when input_dtype is an ArrowDtype.
The reason is that not excluding ArrowDtype would cause an error with the following code:

sr = pd.Series(["A", "B", "A"], dtype=pd.ArrowDtype(pa.string())) sr.str.get_dummies(dtype=pd.ArrowDtype(pa.bool_()))

Output (this issue also exists in the implementation before this PR):

... File "/Users/komo_fr/P_Project/pandas_workspace/pandas-komo_fr/pandas/core/strings/accessor.py", line 2532, in get_dummies DataFrame(result, columns=name, dtype=dtype), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ... pyarrow.lib.ArrowNotImplementedError: Unsupported cast from list<item: bool> to bool using function cast_boolean

With this PR, the default dtype is changed to a boolean type, which makes similar issues more likely to occur.
Since I wasn’t sure how to fully resolve this problem and it could lead to a much larger PR, I chose to exclude ArrowDtype cases for now.

change default dtype of str.get_dummies() to bool

21f2e70

komo-fr commented Jan 2, 2025

View reviewed changes

ignore mypy assignment type check in str.get_dummies

1de77e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Change default dtype of str.get_dummies() to bool #60641

ENH: Change default dtype of str.get_dummies() to bool #60641

komo-fr commented Jan 2, 2025

komo-fr Jan 2, 2025

komo-fr Jan 2, 2025 •

edited

Loading

komo-fr Jan 2, 2025

ENH: Change default dtype of str.get_dummies() to bool #60641

Are you sure you want to change the base?

ENH: Change default dtype of str.get_dummies() to bool #60641

Conversation

komo-fr commented Jan 2, 2025

Overview

Background

Changes Made

Behavior Before and After the Change

Before

After

Additional Notes

Output Before the Change:

Output After the Change:

komo-fr Jan 2, 2025

Choose a reason for hiding this comment

komo-fr Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

komo-fr Jan 2, 2025

Choose a reason for hiding this comment

komo-fr Jan 2, 2025 •

edited

Loading