Add "dataset" subdirectory for all parquet files #366

troyraen · 2024-10-02T20:36:28Z

Feature request

Request

Rename ancillary files like "catalog_info.json" to start with "_" so they will be ignored by default.

Goal

Allow users to make a simple call to pandas.read_parquet (or other standard python parquet readers) without having to specify the ignore_prefixes keyword argument.

Details

Currently, the simplest call that works seems to be:

import pandas as pd

# assuming we're in the hipscat-import root directory
small_sky_object_catalog = "tests/hipscat_import/data/small_sky_object_catalog"

pd.read_parquet(
    small_sky_object_catalog,
    partitioning=None,  # see issue #367 for why this is necessary
    ignore_prefixes=[
        ".",
        "_",
        "catalog_info.json",
        "partition_info.csv",
        "point_map.fits",
        "provenance_info.json",
    ],
)

It's cumbersome to have to specify the ignore_prefixes kwarg every time, but without it that call throws the error:

ArrowInvalid: Could not open Parquet input source 'tests/hipscat_import/data/small_sky_object_catalog/partition_info.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Filenames that start with "." or "_" are ignored by default, so renaming the ancillary files to start with "_" would allow the user to skip the ignore_prefixes kwarg.

Before submitting
Please check the following:

I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.

The text was updated successfully, but these errors were encountered:

nevencaplar · 2024-10-11T17:35:07Z

The problem to be solved by a different directory strucutre:

The directory structure we are proposing follows:

<catalog_dir>/
- properties
- partition_info.csv
- <index.html, etc>
- dataset/
  - _common_metadata
  - _metadata
  - Norder=K/
    - Dir=J/
      - Npix=M.parquet

In this way, the <catalog_dir>/dataset/ directory would be, by itself, a totally valid parquet dataset that can be read by many off-the-shelf parquet libraries.

troyraen added the enhancement New feature or request label Oct 2, 2024

delucchi-cmu added this to HATS / LSDB Oct 2, 2024

nevencaplar added this to the HATS 0.4 milestone Oct 11, 2024

nevencaplar moved this to Todo in HATS / LSDB Oct 11, 2024

delucchi-cmu changed the title ~~Rename ancillary files to start with "_"?~~ Add "dataset" subdirectory for all parquet files Oct 11, 2024

delucchi-cmu mentioned this issue Oct 15, 2024

Insert dataset dir and use general ra/dec column names. #377

Merged

4 tasks

delucchi-cmu closed this as completed Oct 16, 2024

github-project-automation bot moved this from Todo to Done in HATS / LSDB Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "dataset" subdirectory for all parquet files #366

Add "dataset" subdirectory for all parquet files #366

troyraen commented Oct 2, 2024 •

edited

Loading

nevencaplar commented Oct 11, 2024

Add "dataset" subdirectory for all parquet files #366

Add "dataset" subdirectory for all parquet files #366

Comments

troyraen commented Oct 2, 2024 • edited Loading

Request

Goal

Details

nevencaplar commented Oct 11, 2024

troyraen commented Oct 2, 2024 •

edited

Loading