Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust dbt profiles and schema macro to support prod/CI targets #44

Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion dbt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,14 +48,24 @@ Build the models to create views in our Athena warehouse:
dbt run
```

By default, all `dbt` commands will run against the `dev` environment, which
namespaces the resources it creates by prefixing target database names with
your Unix `$USER` name (e.g. `jecochr-default` for the `default` database when
Copy link
Member

@dfsnow dfsnow Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): How does this work when the name of a branch includes an unexpected character (i.e. the / in jeancochrane/28-data-catalog-add-production-profile-to-the-dbt-configuration)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I think we can replace special characters (and uppercase letters, which are not allowed in dbt-athena) using Jinja template filters. I'll take a stab at that now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfsnow This spiraled a little bit, but I was able to make it work in 181e516! It required a new custom macro, but that led to me realize that we were starting to build a lot of logic into a collection of untested macros, so I took a minute to design a quick unit testing framework for our macros. Curious what you think!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Nice work and definitely needed since this is getting more complex. I'm a dumb dumb and commented on a commit instead of the PR. Linking here: 181e516#r123228596

`dbt` is run on Jean's machine). To instead **run commands against prod**,
use the `--target` flag:

```
dbt run --target prod
```

Generate the documentation:

```
dbt docs generate
```

This will create a new file `target/index.html` representing the static
docs website.
docs site.

You can also serve the docs locally:

Expand Down
4 changes: 2 additions & 2 deletions dbt/dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,6 @@ models:
athena:
+materialized: view
default:
+schema: dbt-test-default
dfsnow marked this conversation as resolved.
Show resolved Hide resolved
+schema: default
location:
+schema: dbt-test-location
+schema: location
32 changes: 29 additions & 3 deletions dbt/macros/generate_schema_name.sql
Original file line number Diff line number Diff line change
@@ -1,15 +1,41 @@
-- Override the default schema naming to remove the dbt-added prefix.
-- Override the default schema naming to remove the autogenerated prefix
-- and replace it with our own namespacing on dev and CI.
-- See: https://docs.getdbt.com/docs/build/custom-schemas
{% macro generate_schema_name(custom_schema_name, node) -%}

{#
According to the dbt docs linked above, this is required to be set by
the built-in macro that we are overriding, but we don't actually use it
#}
{%- set default_schema = target.schema -%}

{%- if target.name == "dev" -%}
{%- set schema_prefix = env_var("USER") -%}
{%- elif target.name == "ci" -%}
{%- set schema_prefix = env_var("GITHUB_BASE_REF") -%}
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved
{%- else -%}
{%- set schema_prefix = "" -%}
{%- endif -%}

{%- if custom_schema_name is none -%}

{{ default_schema }}
{#
The default schema name is not allowed, since we use subdirectory
organization to map tables/views to their Athena database
#}
{{ exceptions.raise_compiler_error(
"Missing schema definition for " ~ node.name ~ ". " ~
"Its containing subdirectory is probably missing a `+schema` " ~
"attribute under the `models` config in dbt_project.yml."
) }}

{%- else -%}

{{ custom_schema_name | trim }}
{%- set full_schema_name -%}
{{ schema_prefix ~ "-" ~ custom_schema_name | trim }}
{%- endset -%}

{{ full_schema_name }}

{%- endif -%}

Expand Down
19 changes: 19 additions & 0 deletions dbt/profiles.yml
dfsnow marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,22 @@ athena:
# "database" here corresponds to a Glue data catalog
database: awsdatacatalog
threads: 5
ci:
type: athena
s3_staging_dir: s3://ccao-dbt-athena-ci-us-east-1/results/
s3_data_dir: s3://ccao-dbt-athena-ci-us-east-1/data/
region_name: us-east-1
schema: dbt-test
database: awsdatacatalog
# Prefix all generated data by schema, so that we can delete it when the
# PR is merged
s3_data_naming: schema_table
Comment on lines +21 to +23
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not totally sure if this works yet, since we're not yet using dbt to manage our CTAs, but the idea here is that eventually when dbt builds CTAs and stores them in S3 the schema_table config will tell it to store those files in the configured s3_data_dir bucket with the path {s3_data_dir}/{schema}/{table}/ (docs). That way when we're ready to have CD clean up the resources generated for the PR, we can use the schema name defined by the generate_schema_name to delete everything in {s3_data_dir}/{schema}/ and leave resources that are being used by other PRs intact.

threads: 5
prod:
type: athena
s3_staging_dir: s3://ccao-athena-results-us-east-1/
s3_data_dir: s3://ccao-athena-data-us-east-1/
region_name: us-east-1
schema: default
database: awsdatacatalog
threads: 5