Partitioning column dtypes conflict with Pyarrow's handling of Hive partitioning #367

troyraen · 2024-10-02T20:45:31Z

Bug report

Expected behavior

Call to pandas.read_parquet works without having to explicitly specify the partitioning. I expect that to work because it uses partitioning='hive' by default and hipscat/hats seems to use Hive partitioning.

Actual behavior

That call throws an error.

Minimal reproducible examples

import pandas as pd

# assuming we're in the hipscat-import root directory
small_sky_object_catalog = "tests/hipscat_import/data/small_sky_object_catalog"
ignore_prefixes = [".", "_", "catalog_info.json", "partition_info.csv", "point_map.fits", "provenance_info.json"]

# simplest call that I hope will work
pd.read_parquet(small_sky_object_catalog, ignore_prefixes=ignore_prefixes)

The above throws:

ArrowTypeError: Unable to merge: Field Norder has incompatible types: uint8 vs dictionary<values=int32, indices=int32, ordered=0>

The simplest way to make that call work without throwing an error is to tell it to ignore the partitioning:

pd.read_parquet(small_sky_object_catalog, ignore_prefixes=ignore_prefixes, partitioning=None)

The above is fine when users do not need/want to add a filter to the read. But if filters are wanted, which is likely to be necessary for large catalogs, the calls will be much more efficient when they include a filter on a recognized partition column(s).

The simplest call that results in pyarrow (which is used under the hood) actually recognizing the partitions is:

import pyarrow
import pyarrow.dataset

# NB: Npix will not be recognized as a partition even if it is included here. See issue #368 for details.
partitioning_fields = [
    pyarrow.field(name="Norder", type=pyarrow.uint8()), pyarrow.field(name="Dir", type=pyarrow.uint64())
]
partitioning = pyarrow.dataset.partitioning(schema=pyarrow.schema(partitioning_fields), flavor="hive")

pd.read_parquet(small_sky_object_catalog, ignore_prefixes=ignore_prefixes, partitioning=partitioning)

The efficiency gain is hard to demonstrate with small_sky_object_catalog so I did a test with the PanSTARRS catalog that is in S3. The call using partitioning=None took about 55 times longer. Here is a screenshot of the results:

To reproduce that, use:

import pyarrow.fs

panstarrs_basedir = "stpubdata/panstarrs/ps1/public/hipscat/otmo"
fs = pyarrow.fs.S3FileSystem(region="us-east-1", anonymous=True)
filters = [("Norder", "==", 5), ("Dir", "==", 10000), ("Npix", "==", 10013)]
# plus ignore_prefixes and partitioning from above

Why is this happening

I think what's happening under the hood is:

Pyarrow doesn't expect a Hive-partitioned dataset to contain the partitioning columns in the files themselves. Upon read, it reconstructs those columns by parsing the file paths.
When it finds the partitioning columns in the files it tries to merge with the reconstructed columns but trips over the data types and throws the error. It used dictionary types for the reconstructed columns (see Reading from Partitioned Datasets) but the files did not.

Possible solutions

Use the expected dictionary data types for the partitioning columns when writing the files.
Don't store the partitioning columns in the files themselves.
Ask Apache to support non-dictionary types if that's how they are defined in the files.

Option 1 seems simplest because I'm guessing option 2 would get significant push back from folks who want the files to be able to stand alone. A drawback with either would be that, after the data is loaded, if the user wants to perform operations that require numeric types (+, -, etc.) on those columns they would have to convert them first. To me, that would be preferable to the current situation because the user intervention would be both easier (just df.astype({'Norder': int}) rather than full specification of the partitioning) and would be required far less of often.

Before submitting
Please check the following:

I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.

The text was updated successfully, but these errors were encountered:

troyraen added the bug Something isn't working label Oct 2, 2024

delucchi-cmu added this to HATS / LSDB Oct 2, 2024

troyraen changed the title ~~Partitioning column dtypes conflict with Pyarrows~~ Partitioning column dtypes conflict with Pyarrow's handling of Hive partitioning Oct 2, 2024

troyraen mentioned this issue Oct 11, 2024

Support Hive partitioning at the Npix level astronomy-commons/lsdb#435

Open

3 tasks

nevencaplar moved this to Todo in HATS / LSDB Nov 14, 2024

delucchi-cmu moved this from Todo to Suggested Todo in HATS / LSDB Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitioning column dtypes conflict with Pyarrow's handling of Hive partitioning #367

Partitioning column dtypes conflict with Pyarrow's handling of Hive partitioning #367

troyraen commented Oct 2, 2024 •

edited

Loading

Partitioning column dtypes conflict with Pyarrow's handling of Hive partitioning #367

Partitioning column dtypes conflict with Pyarrow's handling of Hive partitioning #367

Comments

troyraen commented Oct 2, 2024 • edited Loading

Expected behavior

Actual behavior

Minimal reproducible examples

Why is this happening

Possible solutions

troyraen commented Oct 2, 2024 •

edited

Loading