ENH: Implement cum* methods for PyArrow strings #60633

rhshadrach · 2024-12-31T16:07:14Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Implements cumsum, cummin, and cummax for PyArrow-backed strings. I don't see a way to do this without passing through NumPy - if there are other ideas for approaching happy to give those a shot.

cc @WillAyd @jorisvandenbossche

WillAyd · 2024-12-31T16:14:04Z

I don't see a way to do this without passing through NumPy - if there are other ideas for approaching happy to give those a shot.

The only other way I think that would be reasonable to implement this in would be to use nanoarrow, but that's a larger implementation. I think this is fine for now - just not very performant but that can always be improved later

pandas/tests/apply/test_str.py

WillAyd

lgtm. thanks for getting this started

probably worth adding a note for 3.0 as well

jorisvandenbossche · 2025-01-05T14:22:21Z

pandas/conftest.py

@@ -1317,6 +1317,22 @@ def nullable_string_dtype(request):
    return request.param


+@pytest.fixture(
+    params=[
+        pytest.param("str[pyarrow]", marks=td.skip_if_no("pyarrow")),


I was going to comment: I don't think this can work. Although it is then strange the tests are passing :) But it seems this was not doing what I think you expected it was doing -> #60661

I would use the same approach of creating the dtype through StringDtype(..) like in some of the fixtures above

jorisvandenbossche

Thanks!

jorisvandenbossche · 2025-01-05T14:35:10Z

pandas/core/arrays/arrow/array.py

+                nulls = pc.is_null(pa_array)
+                idx = pc.index(nulls, True).as_py()
+                tail = pa.nulls(len(pa_array) - idx, type=pa_array.type)
+                pa_array = pa_array[:idx].combine_chunks()


Is the combine chunks needed here? (I would expect that the conversion to numpy (when calling the numpy func) will do this automatically (and potentially more efficiently))

pandas/tests/apply/test_str.py

jorisvandenbossche · 2025-01-05T14:41:01Z

pandas/tests/series/test_cumulative.py

+            (["x", "z", "y"], "cumsum", False, ["x", "xz", "xzy"]),
+            (["x", pd.NA, "y"], "cumsum", True, ["x", "x", "xy"]),
+            (["x", pd.NA, "y"], "cumsum", False, ["x", pd.NA, pd.NA]),
+            ([pd.NA, "x", "y"], "cumsum", True, ["", "x", "xy"]),


It seems that for numerical data, we actually (somewhat inconsistently?) propagate leading NAs:

In [7]: pd.Series([np.nan, 0.5, 2.5]).cumsum() Out[7]: 0 NaN 1 0.5 2 3.0 dtype: float64

(i.e. the result doesn't have 0.0 for the first element)

Actually not related to "leading" NAs. It seems what is happening is that missing values are ignored to calculate the cumulative result, but then are propagated to the result elementwise. This is also shown in the docstring example of cumsum, so this seems intentional.

Thanks for catching this. Agreed we should match this behavior. I do find it odd, but that's (possibly) for another day!

ENH: Implement cum* methods for PyArrow strings

8a00df4

rhshadrach added Enhancement Strings String extension data type and string data Arrow pyarrow functionality Transformations e.g. cumsum, diff, rank labels Dec 31, 2024

cleanup

170f2e2

rhshadrach added this to the 2.3 milestone Dec 31, 2024

rhshadrach added 4 commits December 31, 2024 11:16

Cleanup

4ccf0d4

fixup

be726f0

Fix extension tests

bf38cef

xfail test when there is no pyarrow

1f8e36e

rhshadrach commented Jan 1, 2025

View reviewed changes

pandas/tests/apply/test_str.py Show resolved Hide resolved

mypy fixups

dd8fcbe

rhshadrach requested a review from WillAyd January 4, 2025 15:36

Merge branch 'main' into enh_cum_methods_for_pyarrow_strings

ed895b9

rhshadrach requested a review from jorisvandenbossche January 4, 2025 15:37

WillAyd approved these changes Jan 4, 2025

View reviewed changes

jorisvandenbossche reviewed Jan 5, 2025

View reviewed changes

rhshadrach marked this pull request as draft January 5, 2025 17:46

rhshadrach added 4 commits January 5, 2025 14:55

Change logic & whatsnew

46ff2c1

Change logic & whatsnew

f2b448d

Fix fixture

4d11a1d

Fixup

d3468cc

rhshadrach marked this pull request as ready for review January 5, 2025 21:21

rhshadrach requested a review from jorisvandenbossche January 5, 2025 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement cum* methods for PyArrow strings #60633

ENH: Implement cum* methods for PyArrow strings #60633

rhshadrach commented Dec 31, 2024

WillAyd commented Dec 31, 2024

WillAyd left a comment

jorisvandenbossche Jan 5, 2025

jorisvandenbossche left a comment

jorisvandenbossche Jan 5, 2025

jorisvandenbossche Jan 5, 2025

jorisvandenbossche Jan 5, 2025

rhshadrach Jan 5, 2025

ENH: Implement cum* methods for PyArrow strings #60633

Are you sure you want to change the base?

ENH: Implement cum* methods for PyArrow strings #60633

Conversation

rhshadrach commented Dec 31, 2024

WillAyd commented Dec 31, 2024

WillAyd left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 5, 2025

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 5, 2025

Choose a reason for hiding this comment

jorisvandenbossche Jan 5, 2025

Choose a reason for hiding this comment

jorisvandenbossche Jan 5, 2025

Choose a reason for hiding this comment

rhshadrach Jan 5, 2025

Choose a reason for hiding this comment