Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add iceberg catalog read support #98

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

Tmonster
Copy link

Piggy backing off of #95

This PR adds support for attaching an Iceberg catalog and be able to read iceberg tables from a datalake. It specifically supports the following sql commands:

Create an ICEBERG secret to access your catalog
CREATE SECRET (
     TYPE ICEBERG,
     CLIENT_ID <your-catalog-client-id>,
     CLIENT_SECRET <your-catalog-client-secret>,
     ENDPOINT 'https://<your-catalog-host>/api/catalog',
     AWS_REGION 'us-east-1'
);

Attach your catalog

ATTACH 'my_catalog' AS my_datalake (TYPE ICEBERG);
-- Select your iceberg tables in the datalake via the catalog
SHOW ALL TABLES;
SELECT * FROM my_datalake.schema.table;

Some tests as well to make sure everything works. There is also a double check on the config since it's possible some catalogs won't return configs. In this case the user is required to have a second key.

Copy link
Collaborator

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I've added a few comments

CMakeLists.txt Outdated
@@ -4,7 +4,7 @@ cmake_minimum_required(VERSION 2.8.12)
set(TARGET_NAME iceberg)
project(${TARGET_NAME})

set(CMAKE_CXX_STANDARD 14)
set(CMAKE_CXX_STANDARD 17)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to bump this? I would prefer not to, to avoid any CI issues

}
return result;

// throw std::runtime_error("No AWS credentials found for table");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can be removed

}
}

// ICConnection &ICTransaction::GetConnection() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's just remove all commented out code. This has been copied from the uc_catalog extension which has a prototype-quality level codebase

require httpfs

statement ok
CREATE SECRET (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for a first PR, but we should think about the UX a little here: right now there is only 1 iceberg secret which is then fetched when calling ATTACH '' AS my_datalake (TYPE ICEBERG);

However it's probably desirable to be able to do something like:

CREATE SECRET irc_secret_1 (TYPE ICEBERG, ENDPOINT 'http://127.0.0.1:8181', BEARER_TOKEN 'bla');
CREATE SECRET irc_secret_2 (TYPE ICEBERG, ENDPOINT 'http://some.other.thing.com', BEARER_TOKEN 'bla');
ATTACH 'irc_secret_1' AS irc1 (TYPE ICEBERG);
ATTACH 'irc_secret_2' AS irc2 (TYPE ICEBERG);

or perhaps

CREATE SECRET irc_secret_1 (TYPE ICEBERG, SCOPE 'http://127.0.0.1:8181', BEARER_TOKEN 'bla');
CREATE SECRET irc_secret_2 (TYPE ICEBERG, SCOPE 'http://some.other.thing.com', BEARER_TOKEN 'bla');
ATTACH 'http://127.0.0.1:8181' AS irc1 (TYPE ICEBERG);
ATTACH 'http://some.other.thing.com' AS irc2 (TYPE ICEBERG);

We can have some discussion on what is nicest, these are currently also open questions for other catalog extensions such as postgres and mysql i think

}

auto &ic_catalog = catalog.Cast<ICCatalog>();
// TODO: handle out-of-order columns using position property
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO is copy pasted from uc_catalog: we should figure out if this is also relevant to iceberg, if so it can stay otherwise we should remove it to avoid confusion

auto &get = (LogicalGet &)*op;
bind_data = std::move(get.bind_data);

return parquet_scan_function;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't work with tables with deletes and potentially return incorrect results. Can we add a test for this?

If it does indeed fail, we should take a look into how to solve this, or at least throw an error

auto table_ref = iceberg_scan_function.bind_replace(context, bind_input);

// 1) Create a Binder and bind the parser-level TableRef -> BoundTableRef
auto binder = Binder::CreateBinder(context);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: formatting

throw NotImplementedException("BindUpdateConstraints");
}

struct MyIcebergFunctionData : public FunctionData {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is dead code i think?

src/common/utils.cpp
src/common/schema.cpp
src/common/iceberg.cpp
src/iceberg_functions/iceberg_snapshots.cpp
src/iceberg_functions/iceberg_scan.cpp
src/iceberg_functions/iceberg_metadata.cpp)
src/iceberg_functions/iceberg_metadata.cpp
src/storage/ic_catalog.cpp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this structure could be a bit clearer

What do you think about renaming the IC to IRC (this code is meant for the Iceberg Rest Catalog after all) and moving everything:
src/storage/irc/irc_catalog.cpp?

This would allow us to later on create another catalog which can handle static iceberg tables, similar to duckdb/duckdb-delta#110 which could then live in src/storage/static/static_catalog.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants