Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation #37400
base: main
Are you sure you want to change the base?
GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation #37400
Changes from 4 commits
f1c6dc0
6ebd6da
70c9267
48350d8
d2a659e
41236d8
8afba81
96c6691
c131341
220b58e
ad96c48
b756241
f43505b
3497f4a
fecd0f0
29cc1c1
ffbb491
4d63428
f689716
8e9cb16
7fd47be
7c4ff4e
feccee9
90245e7
d924e36
0340193
b78eed0
23828e1
6fd57dc
86a8760
f8e724c
447badf
0c1065c
5225e08
a779982
4195406
ed267bd
478889d
2992072
4852261
add1afd
f627e30
bb8d4a5
ad0f1af
e1de5bc
430742a
00f176e
17f4951
de27ce4
259f15b
057b542
34a4c28
70e3508
c587568
2223423
22030db
e9c550a
23fb3fa
d892819
ef3291d
7aee7dd
c5b1fb1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of making a copy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway underlying builder doesn't hold a reference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The builder cannot outlive the FileWriter, so why not simply follow other places like this https://github.com/search?q=repo%3Aapache%2Farrow+%22const+WriterProperties*%22&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't actually appending a new row-group just marking that a row-group is starting so filters should be reset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Parquet uses row-group level bloom filter, so this just setup a new row-group for filters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about putting this to the beginning of
class BloomFilterBuilder
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a recent discussion on the parquet mailing list about bloom filters and what good writers should do. My take-away was:
ndv
if necessary.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think the best way is to buffering the hash values and making a decision later when hash value too much or buffer is too large. But personally I think we can first make a "static" config and enhance it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed that PR and it could be a followup change. Writer implementation has the freedom to try smart things.
FYI, parquet-java also discards the bloom filter if dictionary encoding is applied to all data pages, though I don't think we should do the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it maybe better to make this a map. I expect the number of columns with a bloom filter to be relatively small compared to the number of overall columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 This reuse some structure in PageIndex, however, I think
IndexLocaction
is just 9b, and even for parquet file with 10000 columns, the cost here is low(about 200kib) , so I think we can keep vector here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move these forward declarations to parquet/type_fwd.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be
BloomFilterWriter
instead?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, BloomFilterBuilder is better because the class
BloomFilter
is more like a BloomFilterWriterThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other name is also LGTM, the
BloomFilterBuilder
seems like "container for Bloom Filter Writer in each columns"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a lot of Builders, so
BloomFilterBuilder
looks fine to me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, do we really need a separate file for the new builder? If yes, should we make it an internal header? My rationale is that this kind of builder is used only internally by the writer and users do not have to deal with them at all. However, we have already exposed builders like FileMetaDataBuilder, RowGroupMetaDataBuilder, ColumnChunkMetaDataBuilder, and PageIndexBuilder (unfortunately it was added by myself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this could be a internal class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use a anonymous namespace here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the anonymous namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently I changed to internal, since annoymous namespace for a parquet used structure is a bit hacking 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this static factory? I don't think users can create one without calling this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's a static factory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing it to
/// \brief API to create a BloomFilterBuilder.