Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45092: [C++][Parquet] Add GetReadRanges function to FileReader #45093

Merged
merged 4 commits into from
Jan 5, 2025

Conversation

zeroshade
Copy link
Member

@zeroshade zeroshade commented Dec 20, 2024

Rationale for this change

For some consumers, it is convenient to expose a way to retrieve the necessary byte ranges of a parquet file to read specific column chunks from row groups without having to go through a full ReadRangeCache. Ultimately, it's a fairly simple function since we already have all the infrastructure implemented to compute these ranges.

What changes are included in this PR?

A single function added to parquet::FileReader named GetReadRanges which computes and retrieves the coalesced read ranges for specified row groups and column indices.

@zeroshade zeroshade requested review from kou and lidavidm December 20, 2024 16:55
@zeroshade zeroshade requested a review from wgtmac as a code owner December 20, 2024 16:55
Copy link

⚠️ GitHub issue #45092 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also fix the lint errors?

I don't object this API but is there any conreate use-case of this API?

cpp/src/parquet/file_reader.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 21, 2024
@lidavidm
Copy link
Member

I pinged colleagues to give a review and make sure the API fits their use case. But essentially sometimes we want to accomplish pre-buffering (like ReadRangeCache) but without having to go through that API specifically. (Possibly it would work to implement a custom RandomAccessFile too?)

@wgtmac
Copy link
Member

wgtmac commented Dec 21, 2024

I think this is more like an utility function that users are capable of parsing parquet::FileMetaData to compute the same result on their end. In terms of the extensibility, if users want a third parameter of requested row ranges in addition to row group and column, do we want to support it as well?

FYI there was a discussion on the row range API: https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM and a stale PR: #39731.

@mapleFU
Copy link
Member

mapleFU commented Dec 23, 2024

General looks ok to me ( but without page index, I think it might not prune many data)

@felipeblazing
Copy link

The idea is that we would like to be able to know ahead of time what bytes are going to be read from a parquet file without necessarily performing the computation associated with decoding the file at the same time. The high level use case for us is that sometimes we don't have the computational resources to perform decompression and decoding but we do have available I/O. We want to be able to continue to perform I/O when interacting with object stores and other slow sources of parquet data without having to commit to decompression / decoding until a later point in time.

This would allow us to move bytes from object stores to local memory where it can wait until we have compute resources for decompression / decoding.

@zeroshade
Copy link
Member Author

Lint issues fixed and variable renamed as suggested

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 23, 2024
@mapleFU
Copy link
Member

mapleFU commented Dec 24, 2024

This patch LGTM now. Out of curiousity, why not call PreBuffer in SerializedFile, but call this new interface?

@@ -201,6 +201,32 @@ class PARQUET_EXPORT ParquetFileReader {
const ::arrow::io::IOContext& ctx,
const ::arrow::io::CacheOptions& options);

// Retrieve the list of byte ranges that would need to be read to retrieve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, use /// for a doxygen comment

@lidavidm
Copy link
Member

This patch LGTM now. Out of curiousity, why not call PreBuffer in SerializedFile, but call this new interface?

The idea is that this doesn't actually trigger any I/O. I suppose you could rig a custom RandomAccessFile to do the same thing with PreBuffer but I think it would be significantly more complicated (as you do want to allow I/O to read the footer, and you'll have to resolve the I/O request at some point).

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Dec 26, 2024
@zeroshade zeroshade merged commit b4b46e3 into apache:main Jan 5, 2025
31 of 33 checks passed
@zeroshade zeroshade removed the awaiting merge Awaiting merge label Jan 5, 2025
@zeroshade zeroshade deleted the expose-read-ranges branch January 5, 2025 19:41
@hellishfire
Copy link
Contributor

I think this is more like an utility function that users are capable of parsing parquet::FileMetaData to compute the same result on their end. In terms of the extensibility, if users want a third parameter of requested row ranges in addition to row group and column, do we want to support it as well?

FYI there was a discussion on the row range API: https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM and a stale PR: #39731.

On a side note, do you plan to pick up that stale row range api PR? You once mentioned something like that.

@wgtmac
Copy link
Member

wgtmac commented Jan 7, 2025

Sorry for the delay. I will take it over now. @hellishfire

Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit b4b46e3.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants