Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DuckDB iceberg_scan post processor #3066

Closed
wants to merge 2 commits into from

Conversation

rafaelbey
Copy link
Contributor

@rafaelbey rafaelbey commented Sep 3, 2024

This propose how we can support iceberg_scan thru duckdb.

The proposal has two parts: S3 auth, and iceberg_scan.

For S3 authentication, at the moment there is a new DuckDBS3AuthenticationStrategy, and when executing, is responsible of:

  • install and load iceberg extension
  • create or replace the S3 credential

The second part is the connection post processor. This post processor translate table names following some rules:

  • There is a root path, where the post processor will concatenate for the iceberg_scan path
  • It expects that each schema and table are directories within the root path:
    • root path: s3://warehouse/wh/
    • Schema: nyc
    • Table: taxis
    • Original query: select * from nyc.taxis as "t"
    • Translation: select * from (select * iceberg_scan('s3://warehouse/wh/nyc/taxis')) as "t"

Few challenges/questions/todos:

  • This does NOT include the required grammar for this to work on Legend Engine.
    • Grammar / Composer for new S3 Auth strategy
    • Grammar / Composer for new Iceberg Post Processor
  • Legend Pure relation does not support schema on the name yet, hence the test cases hardcode the schema name on the root path
  • Legend Pure relation only support INT and VARCHAR types
  • This assume we can have a private dyna function for the iceberg_scan translation, disabling an existing check
  • DuckDB at the moment expect some of Hadoop Iceberg hint files to find the right metadata files. This is improving on next version: Addresses #29: Support missing version-hint.txt and provide additional options duckdb/duckdb-iceberg#63. This is visible on the test case setup to try to mimic this expectations
  • To avoid the DuckDB hint file requirement, we could integrate properly with an Iceberg catalog, and leverage its APIs to gather the latest metadata file

Copy link

github-actions bot commented Sep 3, 2024

Test Results

     837 files       837 suites   56m 11s ⏱️
  8 726 tests   8 659 ✔️ 66 💤 1
13 220 runs  13 153 ✔️ 66 💤 1

For more details on these failures, see this check.

Results for commit 3e49602.

@finos-admin
Copy link
Member

This PR is stale because it has been open for 30 days with no activity. Please remove stale label or add any comment to keep this open. Otherwise this will be closed in 5 days.

@finos-admin
Copy link
Member

This PR was closed because it has been inactive for 35 days. Please re-open if this PR is still relevant.

@finos-admin finos-admin closed this Oct 9, 2024
@rafaelbey rafaelbey deleted the iceberg_scan branch December 18, 2024 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants