Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow hints for upcasting parquet to arrow integer types #6892

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gruuya
Copy link
Contributor

@gruuya gruuya commented Dec 17, 2024

Which issue does this PR close?

Closes #6891.

Rationale for this change

Allow casting integers so long as the precision isn't lost.

What changes are included in this PR?

Add possible matches for up-casting all integer DataTypes.

Are there any user-facing changes?

None.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 17, 2024
@gruuya gruuya force-pushed the parquet-hint-integer-upcast branch from 4d39589 to d0385d4 Compare December 17, 2024 13:45
@tustvold
Copy link
Contributor

I'm pretty lukewarm on baking in schema coercion into the parquet reader at this level, the interpretation of how to map the parquet data types to arrow is gnarly enough without needing to also perform non-trivial schema adaption. I think it would be better to handle this at a higher level, e.g. as proposed by #6735

@gruuya
Copy link
Contributor Author

gruuya commented Dec 18, 2024

baking in schema coercion into the parquet reader at this level

Oh I think I understand. My initial impression was that ArrowReaderOptions::with_schema is the intended mechanism for overriding the Parquet level schema

/// Provide a schema to use when reading the parquet file. If provided it
/// takes precedence over the schema inferred from the file or the schema defined
/// in the file's metadata. If the schema is not compatible with the file's
/// schema an error will be returned when constructing the builder.
///
/// This option is only required if you want to cast columns to a different type.
/// For example, if you wanted to cast from an Int64 in the Parquet file to a Timestamp
/// in the Arrow schema.

So the nuance is that in the original issue the goal was to supply a hint to a Parquet with missing type hints, whereas in my case I want to override an already-hinted Parquet schema, thus somewhat abusing/overloading this mechanism? This would also suggest that the above description needs to be changed as it is misleading.

I guess then that #6735 could also provide a solution to this problem (albeit with more effort). Let me take a look at it.

@alamb
Copy link
Contributor

alamb commented Jan 7, 2025

There is a related discussion on:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet schema hint doesn't support integer types upcasting
3 participants