Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust dbt profiles and schema macro to support prod/CI targets #44

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion dbt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,14 +48,24 @@ Build the models to create views in our Athena warehouse:
dbt run
```

By default, all `dbt` commands will run against the `dev` environment, which
namespaces the resources it creates by prefixing target database names with
your Unix `$USER` name (e.g. `jecochr-default` for the `default` database when
Copy link
Member

@dfsnow dfsnow Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): How does this work when the name of a branch includes an unexpected character (i.e. the / in jeancochrane/28-data-catalog-add-production-profile-to-the-dbt-configuration)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I think we can replace special characters (and uppercase letters, which are not allowed in dbt-athena) using Jinja template filters. I'll take a stab at that now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfsnow This spiraled a little bit, but I was able to make it work in 181e516! It required a new custom macro, but that led to me realize that we were starting to build a lot of logic into a collection of untested macros, so I took a minute to design a quick unit testing framework for our macros. Curious what you think!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Nice work and definitely needed since this is getting more complex. I'm a dumb dumb and commented on a commit instead of the PR. Linking here: 181e516#r123228596

`dbt` is run on Jean's machine). To instead **run commands against prod**,
use the `--target` flag:

```
dbt run --target prod
```

Generate the documentation:

```
dbt docs generate
```

This will create a new file `target/index.html` representing the static
docs website.
docs site.

You can also serve the docs locally:

Expand All @@ -74,3 +84,9 @@ Run tests for only one model:
```
dbt test --select <model_name>
```

Run tests for dbt macros:

```
dbt run-operation test_all
```
4 changes: 2 additions & 2 deletions dbt/dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,6 @@ models:
athena:
+materialized: view
default:
+schema: dbt-test-default
dfsnow marked this conversation as resolved.
Show resolved Hide resolved
+schema: default
location:
+schema: dbt-test-location
+schema: location
51 changes: 46 additions & 5 deletions dbt/macros/generate_schema_name.sql
Original file line number Diff line number Diff line change
@@ -1,16 +1,57 @@
-- Override the default schema naming to remove the dbt-added prefix.
-- Override the default schema naming to remove the autogenerated prefix
-- and replace it with our own namespacing on dev and CI.
-- See: https://docs.getdbt.com/docs/build/custom-schemas
{% macro generate_schema_name(custom_schema_name, node) -%}
{{ return(_generate_schema_name(
custom_schema_name,
node,
target,
env_var,
exceptions.raise_compiler_error
)) }}
{%- endmacro %}

-- Helper implementation of generate_schema_name where no arguments
-- are read from the global variable context. Useful for unit testing.
{% macro _generate_schema_name(
custom_schema_name, node, target, env_var_func, raise_error_func
) %}
{#
According to the dbt docs linked above, this is required to be set by
the built-in macro that we are overriding, but we don't actually use it
#}
{%- set default_schema = target.schema -%}

{%- if target.name == "dev" -%}
{%- set schema_prefix = "dev_" ~ env_var_func("USER") ~ "_" -%}
{%- elif target.name == "ci" -%}
{%- set github_head_ref = kebab_slugify(
env_var_func("GITHUB_HEAD_REF")
) -%}
{%- set schema_prefix = "ci_" ~ github_head_ref ~ "_" -%}
{%- else -%}
{%- set schema_prefix = "" -%}
{%- endif -%}

{%- if custom_schema_name is none -%}

{{ default_schema }}
{#
The default schema name is not allowed, since we use subdirectory
organization to map tables/views to their Athena database
#}
{{ return(raise_error_func(
"Missing schema definition for " ~ node.name ~ ". " ~
"Its containing subdirectory is probably missing a `+schema` " ~
"attribute under the `models` config in dbt_project.yml."
)) }}

{%- else -%}

{{ custom_schema_name | trim }}
{%- set full_schema_name -%}
{{ schema_prefix ~ custom_schema_name | trim }}
{%- endset -%}

{%- endif -%}
{{ return(full_schema_name) }}

{%- endmacro %}
{%- endif -%}
{% endmacro %}
16 changes: 16 additions & 0 deletions dbt/macros/slugify.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
-- Variation of dbt_utils.slugify macro using kebab-case instead of snake_case
{% macro kebab_slugify(string) %}
{#- Lower case the string -#}
{% set string = string | lower %}

{#- Replace spaces, slashes, and underscores with hyphens -#}
{% set string = modules.re.sub('[ _/]+', '-', string) %}

{#- Only take letters, numbers, and hyphens -#}
{% set string = modules.re.sub('[^a-z0-9-]+', '', string) %}

{#- Prepends "_" if string begins with a number -#}
{% set string = modules.re.sub('^[0-9]', '_' + string[0], string) %}

{{ return(string) }}
{% endmacro %}
6 changes: 6 additions & 0 deletions dbt/macros/tests/test_all.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
-- dbt macro unit testing inspired by this blog post:
-- https://docs.getdbt.com/blog/unit-testing-dbt-packages
{% macro test_all() %}
{% do test_slugify() %}
{% do test_generate_schema_name() %}
{% endmacro %}
76 changes: 76 additions & 0 deletions dbt/macros/tests/test_generate_schema_name.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
{% macro test_generate_schema_name() %}
{% do test_generate_schema_name_handles_dev_env() %}
{% do test_generate_schema_name_handles_ci_env() %}
{% do test_generate_schema_name_handles_prod_env() %}
{% do test_generate_schema_name_raises_for_default_schema_name() %}
{% endmacro %}

{% macro mock_env_var(var_name) %}
{% if var_name == "USER" %}
{{ return("testuser") }}
{% elif var_name == "GITHUB_HEAD_REF" %}
{{ return("testuser/feature-branch-1") }}
{% else %}
{{ return("") }}
{% endif %}
{% endmacro %}

{% macro mock_raise_compiler_error(_error) %}
{{ return("Compiler error raised") }}
{% endmacro %}

{% macro test_generate_schema_name_handles_dev_env() %}
{% do assert_equals(
"test_generate_schema_name_handles_dev_env",
_generate_schema_name(
"test",
{"name": "test"},
{"schema": "default", "name": "dev"},
mock_env_var,
exceptions.raise_compiler_error
),
"dev_testuser_test"
) %}
{% endmacro %}

{% macro test_generate_schema_name_handles_ci_env() %}
{% do assert_equals(
"test_generate_schema_name_handles_ci_env",
_generate_schema_name(
"test",
{"name": "test"},
{"schema": "default", "name": "ci"},
mock_env_var,
exceptions.raise_compiler_error
),
"ci_testuser-feature-branch-1_test"
) %}
{% endmacro %}

{% macro test_generate_schema_name_handles_prod_env() %}
{% do assert_equals(
"test_generate_schema_name_handles_prod_env",
_generate_schema_name(
"test",
{"name": "test"},
{"schema": "default", "name": "prod"},
mock_env_var,
exceptions.raise_compiler_error
),
"test"
) %}
{% endmacro %}

{% macro test_generate_schema_name_raises_for_default_schema_name() %}
{% do assert_equals(
"test_generate_schema_name_raises_for_default_schema_name",
_generate_schema_name(
None,
{"name": "test"},
{"schema": "default", "name": "prod"},
mock_env_var,
mock_raise_compiler_error
),
"Compiler error raised"
) %}
{% endmacro %}
60 changes: 60 additions & 0 deletions dbt/macros/tests/test_slugify.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
{% macro test_slugify() %}
{% do test_kebab_slugify() %}
{% endmacro %}

{% macro test_kebab_slugify() %}
{% do test_kebab_slugify_lowercases_strings() %}
{% do test_kebab_slugify_replaces_spaces() %}
{% do test_kebab_slugify_replaces_slashes() %}
{% do test_kebab_slugify_replaces_underscores() %}
{% do test_kebab_slugify_removes_special_characters() %}
{% do test_kebab_slugify_handles_leading_numbers() %}
{% endmacro %}

{% macro test_kebab_slugify_lowercases_strings() %}
{{ assert_equals(
"test_kebab_slugify_lowercases_strings",
kebab_slugify("TEST"),
"test"
) }}
{% endmacro %}

{% macro test_kebab_slugify_replaces_spaces() %}
{{ assert_equals(
"test_kebab_slugify_replaces_spaces",
kebab_slugify("t e s t"),
"t-e-s-t"
) }}
{% endmacro %}

{% macro test_kebab_slugify_replaces_slashes() %}
{{ assert_equals(
"test_kebab_slugify_replaces_slashes",
kebab_slugify("t/e/s/t"),
"t-e-s-t"
) }}
{% endmacro %}

{% macro test_kebab_slugify_replaces_underscores() %}
{{ assert_equals(
"test_kebab_slugify_replaces_underscores",
kebab_slugify("t_e_s_t"),
"t-e-s-t"
) }}
{% endmacro %}

{% macro test_kebab_slugify_removes_special_characters() %}
{{ assert_equals(
"test_kebab_slugify_removes_special_characters",
kebab_slugify("t!@#e$%^s&*()t"),
"test"
) }}
{% endmacro %}

{% macro test_kebab_slugify_handles_leading_numbers() %}
{{ assert_equals(
"test_kebab_slugify_handles_leading_numbers",
kebab_slugify("123test"),
"_123test"
) }}
{% endmacro %}
9 changes: 9 additions & 0 deletions dbt/macros/tests/utils.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{% macro assert_equals(test_name, value, expected) %}
{% if value == expected %}
{% do log(test_name ~ " - PASS", info=True) %}
{% else %}
{% do exceptions.raise_compiler_error(
test_name ~ " - FAIL: " ~ value~ " != " ~ expected
) %}
{% endif %}
{% endmacro %}
19 changes: 19 additions & 0 deletions dbt/profiles.yml
dfsnow marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,22 @@ athena:
# "database" here corresponds to a Glue data catalog
database: awsdatacatalog
threads: 5
ci:
type: athena
s3_staging_dir: s3://ccao-dbt-athena-ci-us-east-1/results/
s3_data_dir: s3://ccao-dbt-athena-ci-us-east-1/data/
region_name: us-east-1
schema: dbt-test
database: awsdatacatalog
# Prefix all generated data by schema, so that we can delete it when the
# PR is merged
s3_data_naming: schema_table
Comment on lines +21 to +23
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not totally sure if this works yet, since we're not yet using dbt to manage our CTAs, but the idea here is that eventually when dbt builds CTAs and stores them in S3 the schema_table config will tell it to store those files in the configured s3_data_dir bucket with the path {s3_data_dir}/{schema}/{table}/ (docs). That way when we're ready to have CD clean up the resources generated for the PR, we can use the schema name defined by the generate_schema_name to delete everything in {s3_data_dir}/{schema}/ and leave resources that are being used by other PRs intact.

threads: 5
prod:
type: athena
s3_staging_dir: s3://ccao-athena-results-us-east-1/
s3_data_dir: s3://ccao-athena-data-us-east-1/
region_name: us-east-1
schema: default
database: awsdatacatalog
threads: 5