You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rename ancillary files like "catalog_info.json" to start with "_" so they will be ignored by default.
Goal
Allow users to make a simple call to pandas.read_parquet (or other standard python parquet readers) without having to specify the ignore_prefixes keyword argument.
Details
Currently, the simplest call that works seems to be:
importpandasaspd# assuming we're in the hipscat-import root directorysmall_sky_object_catalog="tests/hipscat_import/data/small_sky_object_catalog"pd.read_parquet(
small_sky_object_catalog,
partitioning=None, # see issue #367 for why this is necessaryignore_prefixes=[
".",
"_",
"catalog_info.json",
"partition_info.csv",
"point_map.fits",
"provenance_info.json",
],
)
It's cumbersome to have to specify the ignore_prefixes kwarg every time, but without it that call throws the error:
ArrowInvalid: Could not open Parquet input source 'tests/hipscat_import/data/small_sky_object_catalog/partition_info.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
Filenames that start with "." or "_" are ignored by default, so renaming the ancillary files to start with "_" would allow the user to skip the ignore_prefixes kwarg.
Before submitting
Please check the following:
I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.
The text was updated successfully, but these errors were encountered:
The problem to be solved by a different directory strucutre:
The directory structure we are proposing follows:
<catalog_dir>/
properties
partition_info.csv
<index.html, etc>
dataset/
_common_metadata
_metadata
Norder=K/
Dir=J/
Npix=M.parquet
In this way, the <catalog_dir>/dataset/ directory would be, by itself, a totally valid parquet dataset that can be read by many off-the-shelf parquet libraries.
Feature request
Request
Rename ancillary files like "catalog_info.json" to start with "_" so they will be ignored by default.
Goal
Allow users to make a simple call to
pandas.read_parquet
(or other standard python parquet readers) without having to specify theignore_prefixes
keyword argument.Details
Currently, the simplest call that works seems to be:
It's cumbersome to have to specify the
ignore_prefixes
kwarg every time, but without it that call throws the error:Filenames that start with "." or "_" are ignored by default, so renaming the ancillary files to start with "_" would allow the user to skip the
ignore_prefixes
kwarg.Before submitting
Please check the following:
The text was updated successfully, but these errors were encountered: