Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "dataset" subdirectory for all parquet files #366

Closed
3 tasks done
troyraen opened this issue Oct 2, 2024 · 1 comment
Closed
3 tasks done

Add "dataset" subdirectory for all parquet files #366

troyraen opened this issue Oct 2, 2024 · 1 comment
Labels
enhancement New feature or request
Milestone

Comments

@troyraen
Copy link
Contributor

troyraen commented Oct 2, 2024

Feature request

Request

Rename ancillary files like "catalog_info.json" to start with "_" so they will be ignored by default.

Goal

Allow users to make a simple call to pandas.read_parquet (or other standard python parquet readers) without having to specify the ignore_prefixes keyword argument.

Details

Currently, the simplest call that works seems to be:

import pandas as pd

# assuming we're in the hipscat-import root directory
small_sky_object_catalog = "tests/hipscat_import/data/small_sky_object_catalog"

pd.read_parquet(
    small_sky_object_catalog,
    partitioning=None,  # see issue #367 for why this is necessary
    ignore_prefixes=[
        ".",
        "_",
        "catalog_info.json",
        "partition_info.csv",
        "point_map.fits",
        "provenance_info.json",
    ],
)

It's cumbersome to have to specify the ignore_prefixes kwarg every time, but without it that call throws the error:

ArrowInvalid: Could not open Parquet input source 'tests/hipscat_import/data/small_sky_object_catalog/partition_info.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Filenames that start with "." or "_" are ignored by default, so renaming the ancillary files to start with "_" would allow the user to skip the ignore_prefixes kwarg.


Before submitting
Please check the following:

  • I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
  • I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
  • If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.
@troyraen troyraen added the enhancement New feature or request label Oct 2, 2024
@nevencaplar
Copy link
Member

The problem to be solved by a different directory strucutre:

The directory structure we are proposing follows:

  • <catalog_dir>/
    • properties
    • partition_info.csv
    • <index.html, etc>
    • dataset/
      • _common_metadata
      • _metadata
      • Norder=K/
        • Dir=J/
          • Npix=M.parquet

In this way, the <catalog_dir>/dataset/ directory would be, by itself, a totally valid parquet dataset that can be read by many off-the-shelf parquet libraries.

@nevencaplar nevencaplar added this to the HATS 0.4 milestone Oct 11, 2024
@nevencaplar nevencaplar moved this to Todo in HATS / LSDB Oct 11, 2024
@delucchi-cmu delucchi-cmu changed the title Rename ancillary files to start with "_"? Add "dataset" subdirectory for all parquet files Oct 11, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in HATS / LSDB Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

3 participants