Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not read schema from 'https://data.lsdb.io/hats/gaia_dr3/gaia/' #489

Closed
OliviaLynn opened this issue Nov 1, 2024 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@OliviaLynn
Copy link
Member

Bug report
I encounter unexpected output in the second code cell of the Topic: Manual catalog verification demo.

I would expect to get (as in the doc):

Validating catalog at path https://data.lsdb.io/hats/gaia_dr3/gaia/ ...
Found 3933 partitions.
Approximate coverage is 100.00 % of the sky.
True

Instead, I get:

{
	"name": "ArrowInvalid",
	"message": "Error creating dataset. Could not read schema from 'https://data.lsdb.io/hats/gaia_dr3/gaia/'. Is this a 'parquet' file?: Could not open Parquet input source 'https://data.lsdb.io/hats/gaia_dr3/gaia/': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.",
	"stack": "---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[2], line 1
----> 1 is_valid_catalog(gaia_catalog_path, verbose=True, fail_fast=True, strict=True)

File ~/.local/lib/python3.11/site-packages/hats/io/validation.py:125, in is_valid_catalog(pointer, strict, fail_fast, verbose)
    113 ignore_prefixes = [
    114     \"_common_metadata\",
    115     \"_metadata\",
   (...)
    121     \"README\",
    122 ]
    124 # As a side effect, this confirms that we can load the directory as a valid dataset.
--> 125 (dataset_path, dataset) = read_parquet_dataset(
    126     pointer,
    127     ignore_prefixes=ignore_prefixes,
    128     exclude_invalid_files=False,
    129 )
    131 parquet_path_pixels = []
    132 for hats_file in dataset.files:

File ~/.local/lib/python3.11/site-packages/hats/io/file_io/file_io.py:190, in read_parquet_dataset(source, **kwargs)
    187     file_system = source.fs
    188     source = source.path
--> 190 dataset = pds.dataset(
    191     source,
    192     filesystem=file_system,
    193     format=\"parquet\",
    194     **kwargs,
    195 )
    196 return (str(source), dataset)

File ~/.conda/envs/lsdb/lib/python3.11/site-packages/pyarrow/dataset.py:794, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    783 kwargs = dict(
    784     schema=schema,
    785     filesystem=filesystem,
   (...)
    790     selector_ignore_prefixes=ignore_prefixes
    791 )
    793 if _is_path_like(source):
--> 794     return _filesystem_dataset(source, **kwargs)
    795 elif isinstance(source, (tuple, list)):
    796     if all(_is_path_like(elem) or isinstance(elem, FileInfo) for elem in source):

File ~/.conda/envs/lsdb/lib/python3.11/site-packages/pyarrow/dataset.py:486, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    478 options = FileSystemFactoryOptions(
    479     partitioning=partitioning,
    480     partition_base_dir=partition_base_dir,
    481     exclude_invalid_files=exclude_invalid_files,
    482     selector_ignore_prefixes=selector_ignore_prefixes
    483 )
    484 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
--> 486 return factory.finish(schema)

File ~/.conda/envs/lsdb/lib/python3.11/site-packages/pyarrow/_dataset.pyx:3126, in pyarrow._dataset.DatasetFactory.finish()

File ~/.conda/envs/lsdb/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/.conda/envs/lsdb/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: Error creating dataset. Could not read schema from 'https://data.lsdb.io/hats/gaia_dr3/gaia/'. Is this a 'parquet' file?: Could not open Parquet input source 'https://data.lsdb.io/hats/gaia_dr3/gaia/': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file."
}

I am running this in a Python notebook on USDF. I'm using lsdb via pip install 'lsdb[full]'.

Maybe I've missed an installation step? The previous cell (and all previous tutorial notebooks) seem to run fine.

@OliviaLynn OliviaLynn added the bug Something isn't working label Nov 1, 2024
@delucchi-cmu
Copy link
Contributor

Yes - this has been addressed in astronomy-commons/hats#404, but this has not been released. Does this still occur if you install hats from current main?

@nevencaplar
Copy link
Member

@OliviaLynn Can you confirm if this is solved?

@nevencaplar nevencaplar moved this to In Progress in HATS / LSDB Nov 8, 2024
@OliviaLynn
Copy link
Member Author

OliviaLynn commented Nov 14, 2024

Apologies--I've had GitHub notifications off. I just checked and it works!

@github-project-automation github-project-automation bot moved this from In Progress to Done in HATS / LSDB Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants