-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet File Metadata caching implementation #586
base: project-antalya-24.12.2
Are you sure you want to change the base?
Parquet File Metadata caching implementation #586
Conversation
|
||
ParquetFileMetaDataCache * ParquetFileMetaDataCache::instance(UInt64 max_cache_entries) | ||
{ | ||
static ParquetFileMetaDataCache instance(max_cache_entries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, in CacheBase
constuctor first argument is max_size_in_bytes
, not limit for entries.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Common/CacheBase.h#L47
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch, I missed this in the first review. I suppose it make sense to create an extra setting for cache byte size then, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented it, but perhaps we could keep only max_size_bytes instead of max_entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. From my point of view limit in bytes is more predictable that limit in entries.
src/Core/ServerSettings.cpp
Outdated
|
||
|
||
DECLARE(UInt32, allow_feature_tier, 0, "0 - All feature tiers allowed (experimental, beta, production). 1 - Only beta and production feature tiers allowed. 2 - Only production feature tier allowed", 0) \ | ||
DECLARE(UInt64, input_format_parquet_metadata_cache_max_size, 10000*50000, "Maximum size of parquet file metadata cache", 0) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think better make const like other default cache sizes.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Core/Defines.h#L92
All other LGTM.
@@ -528,7 +595,7 @@ void ParquetBlockInputFormat::initializeIfNeeded() | |||
if (is_stopped) | |||
return; | |||
|
|||
metadata = parquet::ReadMetaData(arrow_file); | |||
metadata = getFileMetaData(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will happen if the object metadata was put in the cache, but later on that object was deleted from S3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll fail on S3::getObjectInfo call that is made prior to reaching this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SETTINGS input_format_parquet_use_metadata_cache = 1, input_format_parquet_filter_push_down = 1, log_comment = 'abc', remote_filesystem_read_prefetch = 0
Query id: 2d36d362-894b-4ad5-a075-6960bca9de88
Elapsed: 148.109 sec.
Received exception from server (version 24.12.2):
Code: 499. DB::Exception: Received from localhost:9000. DB::Exception: Failed to get object info: No response body.. HTTP response code: 404: while reading adir/test0.parquet: The table structure cannot be extracted from a Parquet format file. You can specify the structure manually. (S3_ERROR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would adding a new object to s3, change the query result? (i.e. how cache is updated in this case)
Would deleting an object from s3, that has a metadata in cache, cause any issues (would there be an attempt to perform a read on object, based on what is in the cache?)
No. The cache is file specific, so new files would not affect existing cache.
Under normal circustances, no. This cache does not eliminate all S3 calls, including some calls that are made prior to reaching parquet processor. I have not investigated these prior calls deeply, but they are probably related to something like "checking if clickhouse has access to bucket and file actually exists". Now, we could consider the scenario where these calls have been passed and in between these calls, the object gets deleted. But that seems rare. |
Just caused this issue on purpose using breakpoints + deleting the object on S3:
Cache is not invalidated, tho |
1d0a8f3
to
c2301d4
Compare
This is an automated comment for commit c2301d4 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
Clean copy of #541
More details and documentation in link.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Implements Parquet Metadata caching.
Documentation entry for user-facing changes
Modify your CI run:
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Include tests (required builds will be added automatically):
Exclude tests:
Extra options:
Only specified batches in multi-batch jobs: