Parquet File Metadata caching implementation #586

arthurpassos · 2025-01-09T18:25:29Z

Clean copy of #541

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Implements Parquet Metadata caching.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

Modify your CI run:

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

Exclude tests:

Extra options:

do not test (only style check)
disable merge-commit (no merge from master before tests)
disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

1
2
3
4

ianton-ru · 2025-01-10T16:59:03Z

src/Processors/Formats/Impl/ParquetBlockInputFormat.cpp

+
+ParquetFileMetaDataCache *  ParquetFileMetaDataCache::instance(UInt64 max_cache_entries)
+{
+    static ParquetFileMetaDataCache instance(max_cache_entries);


As I understand, in CacheBase constuctor first argument is max_size_in_bytes, not limit for entries.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Common/CacheBase.h#L47

Ouch, I missed this in the first review. I suppose it make sense to create an extra setting for cache byte size then, wdyt?

I implemented it, but perhaps we could keep only max_size_bytes instead of max_entries

Agree. From my point of view limit in bytes is more predictable that limit in entries.

ianton-ru · 2025-01-14T18:40:43Z

src/Core/ServerSettings.cpp

-
-
+    DECLARE(UInt32, allow_feature_tier, 0, "0 - All feature tiers allowed (experimental, beta, production). 1 - Only beta and production feature tiers allowed. 2 - Only production feature tier allowed", 0)  \
+    DECLARE(UInt64, input_format_parquet_metadata_cache_max_size, 10000*50000, "Maximum size of parquet file metadata cache", 0) \


I think better make const like other default cache sizes.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Core/Defines.h#L92

All other LGTM.

Enmk · 2025-01-16T08:54:25Z

src/Processors/Formats/Impl/ParquetBlockInputFormat.cpp

@@ -528,7 +595,7 @@ void ParquetBlockInputFormat::initializeIfNeeded()
    if (is_stopped)
        return;

-    metadata = parquet::ReadMetaData(arrow_file);
+    metadata = getFileMetaData();


what will happen if the object metadata was put in the cache, but later on that object was deleted from S3?

It'll fail on S3::getObjectInfo call that is made prior to reaching this code.

SETTINGS input_format_parquet_use_metadata_cache = 1, input_format_parquet_filter_push_down = 1, log_comment = 'abc', remote_filesystem_read_prefetch = 0 Query id: 2d36d362-894b-4ad5-a075-6960bca9de88 Elapsed: 148.109 sec. Received exception from server (version 24.12.2): Code: 499. DB::Exception: Received from localhost:9000. DB::Exception: Failed to get object info: No response body.. HTTP response code: 404: while reading adir/test0.parquet: The table structure cannot be extracted from a Parquet format file. You can specify the structure manually. (S3_ERROR)

Enmk

Would adding a new object to s3, change the query result? (i.e. how cache is updated in this case)
Would deleting an object from s3, that has a metadata in cache, cause any issues (would there be an attempt to perform a read on object, based on what is in the cache?)

arthurpassos · 2025-01-16T16:47:19Z

Would adding a new object to s3, change the query result?

No. The cache is file specific, so new files would not affect existing cache.

Would deleting an object from s3, that has a metadata in cache, cause any issues

Under normal circustances, no. This cache does not eliminate all S3 calls, including some calls that are made prior to reaching parquet processor. I have not investigated these prior calls deeply, but they are probably related to something like "checking if clickhouse has access to bucket and file actually exists".

Now, we could consider the scenario where these calls have been passed and in between these calls, the object gets deleted. But that seems rare.

arthurpassos · 2025-01-16T17:02:42Z

Now, we could consider the scenario where these calls have been passed and in between these calls, the object gets deleted. But that seems rare.

Just caused this issue on purpose using breakpoints + deleting the object on S3:

Query id: 4c03b02d-2336-4c56-857e-78a49ad55d15


Elapsed: 16.127 sec. 

Received exception from server (version 24.12.2):
Code: 117. DB::Exception: Received from localhost:9000. DB::Exception: IOError: Code: 499. DB::Exception: The specified key does not exist.: while reading key: adir/test1.parquet, from bucket: storage-headers-test. (S3_ERROR) (version 24.12.2.1): (in file/uri storage-headers-test/adir/test1.parquet): While executing ParquetBlockInputFormat: While executing S3Source. (INCORRECT_DATA)

Cache is not invalidated, tho

altinity-robot · 2025-01-29T14:57:27Z

This is an automated comment for commit c2301d4 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	❌ failure
Sign aarch64	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ error
Sign release	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ error
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	❌ failure
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	❌ failure
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	❌ failure

Successful checks

Check name	Description	Status
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Ready for release	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success

ianton-ru reviewed Jan 10, 2025

View reviewed changes

ianton-ru reviewed Jan 14, 2025

View reviewed changes

Enmk reviewed Jan 16, 2025

View reviewed changes

arthurpassos added 7 commits January 29, 2025 15:20

impl

0ab9729

add tests

4a0ac52

small updates

33d163d

added max byte limit, still need to test

26a6310

remove max_entries cache

36bb1cb

:D

7693018

Update ServerSettings.cpp

c2301d4

Enmk force-pushed the metadata_cache_for_parquet_24_12_2 branch from 1d0a8f3 to c2301d4 Compare January 29, 2025 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet File Metadata caching implementation #586

Parquet File Metadata caching implementation #586

arthurpassos commented Jan 9, 2025

ianton-ru Jan 10, 2025 •

edited

Loading

arthurpassos Jan 10, 2025

arthurpassos Jan 10, 2025

ianton-ru Jan 11, 2025

ianton-ru Jan 14, 2025

Enmk Jan 16, 2025

arthurpassos Jan 16, 2025

arthurpassos Jan 16, 2025

Enmk left a comment

arthurpassos commented Jan 16, 2025

arthurpassos commented Jan 16, 2025

altinity-robot commented Jan 29, 2025 •

edited

Loading



		DECLARE(UInt32, allow_feature_tier, 0, "0 - All feature tiers allowed (experimental, beta, production). 1 - Only beta and production feature tiers allowed. 2 - Only production feature tier allowed", 0) \
		DECLARE(UInt64, input_format_parquet_metadata_cache_max_size, 10000*50000, "Maximum size of parquet file metadata cache", 0) \

Parquet File Metadata caching implementation #586

Are you sure you want to change the base?

Parquet File Metadata caching implementation #586

Conversation

arthurpassos commented Jan 9, 2025

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Modify your CI run:

Include tests (required builds will be added automatically):

Exclude tests:

Extra options:

Only specified batches in multi-batch jobs:

ianton-ru Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

arthurpassos Jan 10, 2025

Choose a reason for hiding this comment

arthurpassos Jan 10, 2025

Choose a reason for hiding this comment

ianton-ru Jan 11, 2025

Choose a reason for hiding this comment

ianton-ru Jan 14, 2025

Choose a reason for hiding this comment

Enmk Jan 16, 2025

Choose a reason for hiding this comment

arthurpassos Jan 16, 2025

Choose a reason for hiding this comment

arthurpassos Jan 16, 2025

Choose a reason for hiding this comment

Enmk left a comment

Choose a reason for hiding this comment

arthurpassos commented Jan 16, 2025

arthurpassos commented Jan 16, 2025

altinity-robot commented Jan 29, 2025 • edited Loading

ianton-ru Jan 10, 2025 •

edited

Loading

altinity-robot commented Jan 29, 2025 •

edited

Loading