Partitioning column dtypes conflict with Pyarrow's handling of Hive partitioning #367
Open
3 tasks done
Labels
bug
Something isn't working
Bug report
Expected behavior
Call to
pandas.read_parquet
works without having to explicitly specify the partitioning. I expect that to work because it usespartitioning='hive'
by default and hipscat/hats seems to use Hive partitioning.Actual behavior
That call throws an error.
Minimal reproducible examples
The above throws:
The simplest way to make that call work without throwing an error is to tell it to ignore the partitioning:
The above is fine when users do not need/want to add a filter to the read. But if filters are wanted, which is likely to be necessary for large catalogs, the calls will be much more efficient when they include a filter on a recognized partition column(s).
The simplest call that results in pyarrow (which is used under the hood) actually recognizing the partitions is:
The efficiency gain is hard to demonstrate with small_sky_object_catalog so I did a test with the PanSTARRS catalog that is in S3. The call using
partitioning=None
took about 55 times longer. Here is a screenshot of the results:To reproduce that, use:
Why is this happening
I think what's happening under the hood is:
Possible solutions
Option 1 seems simplest because I'm guessing option 2 would get significant push back from folks who want the files to be able to stand alone. A drawback with either would be that, after the data is loaded, if the user wants to perform operations that require numeric types (+, -, etc.) on those columns they would have to convert them first. To me, that would be preferable to the current situation because the user intervention would be both easier (just
df.astype({'Norder': int})
rather than full specification of the partitioning) and would be required far less of often.Before submitting
Please check the following:
The text was updated successfully, but these errors were encountered: