-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45092: [C++][Parquet] Add GetReadRanges function to FileReader #45093
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also fix the lint errors?
I don't object this API but is there any conreate use-case of this API?
I pinged colleagues to give a review and make sure the API fits their use case. But essentially sometimes we want to accomplish pre-buffering (like ReadRangeCache) but without having to go through that API specifically. (Possibly it would work to implement a custom RandomAccessFile too?) |
I think this is more like an utility function that users are capable of parsing FYI there was a discussion on the row range API: https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM and a stale PR: #39731. |
General looks ok to me ( but without page index, I think it might not prune many data) |
The idea is that we would like to be able to know ahead of time what bytes are going to be read from a parquet file without necessarily performing the computation associated with decoding the file at the same time. The high level use case for us is that sometimes we don't have the computational resources to perform decompression and decoding but we do have available I/O. We want to be able to continue to perform I/O when interacting with object stores and other slow sources of parquet data without having to commit to decompression / decoding until a later point in time. This would allow us to move bytes from object stores to local memory where it can wait until we have compute resources for decompression / decoding. |
Lint issues fixed and variable renamed as suggested |
This patch LGTM now. Out of curiousity, why not call |
cpp/src/parquet/file_reader.h
Outdated
@@ -201,6 +201,32 @@ class PARQUET_EXPORT ParquetFileReader { | |||
const ::arrow::io::IOContext& ctx, | |||
const ::arrow::io::CacheOptions& options); | |||
|
|||
// Retrieve the list of byte ranges that would need to be read to retrieve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, use ///
for a doxygen comment
The idea is that this doesn't actually trigger any I/O. I suppose you could rig a custom RandomAccessFile to do the same thing with PreBuffer but I think it would be significantly more complicated (as you do want to allow I/O to read the footer, and you'll have to resolve the I/O request at some point). |
On a side note, do you plan to pick up that stale row range api PR? You once mentioned something like that. |
Sorry for the delay. I will take it over now. @hellishfire |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit b4b46e3. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
For some consumers, it is convenient to expose a way to retrieve the necessary byte ranges of a parquet file to read specific column chunks from row groups without having to go through a full
ReadRangeCache
. Ultimately, it's a fairly simple function since we already have all the infrastructure implemented to compute these ranges.What changes are included in this PR?
A single function added to
parquet::FileReader
namedGetReadRanges
which computes and retrieves the coalesced read ranges for specified row groups and column indices.