-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST(string dtype): Resolve HDF5 xfails in test_put.py #60625
Conversation
@@ -3185,6 +3186,8 @@ def write_array( | |||
# both self._filters and EA | |||
|
|||
value = extract_array(obj, extract_numpy=True) | |||
if isinstance(value, BaseStringArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this logic live in extract_array
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line from the docstring of extract_array
states:
Extract the ndarray or ExtensionArray from a Series or Index.
and the extract_numpy
argument:
Whether to extract the ndarray from a NumpyExtensionArray.
So I think no - I would expect that function to still return the ExtensionArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the tests currently cover the "string" data type going through pytables? This seems like it might mangle the NA markers?
Not super familiar with pytables so not saying this is right or wrong - just want to double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are certainly tests with dataframe roundtrip that contain string columns, not entirely sure if there are also tests where those columns have missing values, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche - I'm not seeing this particular line hit with string dtype. I added a test, including an NA value.
@WillAyd - the current behavior is to write out the underlying objects, and then infer upon loading. So if we start with string an future.infer_string=False
, we get object. When that option is True, we get str.
pandas/tests/io/pytables/test_put.py
Outdated
if using_infer_string: | ||
msg = "Saving a MultiIndex with an extension dtype is not supported." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if using_infer_string: | |
msg = "Saving a MultiIndex with an extension dtype is not supported." | |
if using_infer_string: | |
# TODO(infer_string) make this work for string dtype | |
msg = "Saving a MultiIndex with an extension dtype is not supported." |
(just as a reminder, because ideally we still solve this)
if using_string_dtype() and is_string_array(values, skipna=True): | ||
if ( | ||
using_string_dtype() | ||
and isinstance(values, np.ndarray) | ||
and is_string_array(values, skipna=True) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same code pattern happens above in SeriesFixed read()
method. Not entirely sure if the same change should be applied there as well, but would expect so (but maybe not covered by any test?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, added a test
|
||
|
||
def test_put_mixed_type(setup_path, performance_warning): | ||
df = DataFrame( | ||
np.random.default_rng(2).standard_normal((10, 4)), | ||
columns=Index(list("ABCD"), dtype=object), | ||
columns=Index(list("ABCD")), | ||
index=date_range("2000-01-01", periods=10, freq="B"), | ||
) | ||
df["obj1"] = "foo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df["obj1"] = "foo" | |
df["obj1"] = np.array(["foo"] * 10, dtype=object) | |
df["str"] = pd.Series(["a", None, "b", "c", "d"] * 2).array |
To explicitly test here with both object dtype and string dtype, and with strings with missing values (this is passing locally, so that should address Will's comment at https://github.com/pandas-dev/pandas/pull/60625/files#r1900476124
Co-authored-by: Joris Van den Bossche <[email protected]>
Closing in favor of #60663 |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.It would be more performant to implement a way of writing the string dtypes without converting to NumPy, but this approach gets us correct behavior for now.