Adjust dbt profiles and schema macro to support prod/CI targets #44

jeancochrane · 2023-07-31T17:17:49Z

This PR adjusts our dbt profiles and our custom generate_schema_name macro to support production and CI targets for the data catalog.

The way this configuration works is:

We can choose which target environment we want to build or test using the --target flag that is accepted by all dbt commands. If no flag is provided, the default environment is dev.
When building a target in a specific environment, dbt will build the target into an Athena database with an automatic prefix prepended to the database name in order to namespace the environment. We choose to use prefix namespacing to distinguish between environments instead of building databases in isolated data catalogs because we need to leverage prebuilt tables from the default catalog in our dev and CI environments in order to save time when running running build and test commands.
- The dev environment adds a prefix to the database name matching to the caller's Unix username, e.g. the output database will be called dev-jecochr-default for targets built into the database named default when dbt run is run on my machine.
- The ci environment adds a prefix matching to the name of the branch, e.g. ci-feature-branch-1-default, with special characters removed and all letters forcibly lowercased to match dbt-athena's requirements.
  - When we set up CI in [Data catalog] Define GitHub Actions workflows for building the dbt DAG and running tests #31, we will pull this value from the builtin $GITHUB_BASE_REF env variable (docs). We will also detect when the pull request has been closed or merged (docs) and use that as a trigger to cleanup any resources that were created for the PR.
- The prod environment does not add any prefix, e.g. default.

A quick test confirming this behavior on my machine:

(venv) jecochr@21CCAO-LAPTOP54:~/code/data-architecture/dbt$ dbt run --select vw_pin10_location_test
16:49:17  Running with dbt=1.5.4
16:49:18  Registered adapter: athena=1.5.1
16:49:18  Unable to do partial parsing because change detected to override macro. Starting full parse.
16:49:18  Found 10 models, 19 tests, 0 snapshots, 0 analyses, 456 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
16:49:18
16:49:22  Concurrency: 5 threads (target='dev')
16:49:22
16:49:22  1 of 1 START sql view model jecochr-location.vw_pin10_location_test ............ [RUN]
16:49:25  1 of 1 OK created sql view model jecochr-location.vw_pin10_location_test ....... [OK -1 in 2.71s]
16:49:25
16:49:25  Finished running 1 view model in 0 hours 0 minutes and 6.22 seconds (6.22s).
16:49:25
16:49:25  Completed successfully
16:49:25
16:49:25  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

This PR also adds a new framework for unit testing macros. Macro tests can now be run via the following command:

$ dbt run-operation test_all

Sample output:

(venv) jecochr@21CCAO-LAPTOP54:~/code/data-architecture/dbt$ dbt run-operation test_all
20:51:23  Running with dbt=1.5.4
20:51:23  Registered adapter: athena=1.5.1
20:51:24  Found 10 models, 19 tests, 0 snapshots, 0 analyses, 475 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
20:51:24  test_kebab_slugify_lowercases_strings - PASS
20:51:24  test_kebab_slugify_replaces_spaces - PASS
20:51:24  test_kebab_slugify_replaces_slashes - PASS
20:51:24  test_kebab_slugify_replaces_underscores - PASS
20:51:24  test_kebab_slugify_removes_special_characters - PASS
20:51:24  test_kebab_slugify_handles_leading_numbers - PASS
20:51:24  test_generate_schema_name_handles_dev_env - PASS
20:51:24  test_generate_schema_name_handles_ci_env - PASS
20:51:24  test_generate_schema_name_handles_prod_env - PASS
20:51:24  test_generate_schema_name_raises_for_default_schema_name - PASS

Closes #28.

dbt/dbt_project.yml

jeancochrane · 2023-07-31T17:25:14Z

dbt/profiles.yml

+      # Prefix all generated data by schema, so that we can delete it when the
+      # PR is merged
+      s3_data_naming: schema_table


I am actually not totally sure if this works yet, since we're not yet using dbt to manage our CTAs, but the idea here is that eventually when dbt builds CTAs and stores them in S3 the schema_table config will tell it to store those files in the configured s3_data_dir bucket with the path {s3_data_dir}/{schema}/{table}/ (docs). That way when we're ready to have CD clean up the resources generated for the PR, we can use the schema name defined by the generate_schema_name to delete everything in {s3_data_dir}/{schema}/ and leave resources that are being used by other PRs intact.

dbt/profiles.yml

jeancochrane · 2023-07-31T17:47:10Z

Requesting both @dfsnow and @wrridgeway since this PR makes some major architectural choices about how we plan to structure dbt environments, but I only need a detailed review from one of you!

dfsnow · 2023-07-31T17:53:22Z

dbt/README.md

@@ -48,14 +48,24 @@ Build the models to create views in our Athena warehouse:
 dbt run
 ```

+By default, all `dbt` commands will run against the `dev` environment, which
+namespaces the resources it creates by prefixing target database names with
+your Unix `$USER` name (e.g. `jecochr-default` for the `default` database when


question (blocking): How does this work when the name of a branch includes an unexpected character (i.e. the / in jeancochrane/28-data-catalog-add-production-profile-to-the-dbt-configuration)?

Good question! I think we can replace special characters (and uppercase letters, which are not allowed in dbt-athena) using Jinja template filters. I'll take a stab at that now.

@dfsnow This spiraled a little bit, but I was able to make it work in 181e516! It required a new custom macro, but that led to me realize that we were starting to build a lot of logic into a collection of untested macros, so I took a minute to design a quick unit testing framework for our macros. Curious what you think!

This is great! Nice work and definitely needed since this is getting more complex. I'm a dumb dumb and commented on a commit instead of the PR. Linking here: 181e516#r123228596

dbt/macros/generate_schema_name.sql

dfsnow · 2023-07-31T18:48:04Z

Overall, I think this looks great. My one note is that it might be worthwhile to prefix the prefix with something that identifies a dev table so that we can easily grep them using the CLI (I suggest dev-).

So jecochr-location.vw_pin10_location_test would become dev-jecochr-location.vw_pin10_location_test.

…ion macro

…scores See comment thread here for reasoning: 181e516#r123228596

Adjust dbt profiles and schema macro to support prod/CI targets

1fd82dd

jeancochrane commented Jul 31, 2023

View reviewed changes

Adjust the dbt README to document the use of the --target flag

902f61d

jeancochrane force-pushed the jeancochrane/28-data-catalog-add-production-profile-to-the-dbt-configuration branch from 6c0fa37 to 902f61d Compare July 31, 2023 17:36

jeancochrane marked this pull request as ready for review July 31, 2023 17:47

jeancochrane requested a review from a team as a code owner July 31, 2023 17:47

jeancochrane requested review from dfsnow and wrridgeway July 31, 2023 17:47

dfsnow reviewed Jul 31, 2023

View reviewed changes

dbt/macros/generate_schema_name.sql Outdated Show resolved Hide resolved

jeancochrane added 2 commits July 31, 2023 13:48

Switch from GITHUB_BASE_REF to GITHUB_HEAD_REF for dbt schema generat…

62dc2b3

…ion macro

Slugify branch names in dbt CI environment and add macro unit tests

181e516

jeancochrane requested a review from dfsnow July 31, 2023 21:02

Adjust dbt generate_schema_name logic to separate keywords with under…

2f57842

…scores See comment thread here for reasoning: 181e516#r123228596

dfsnow approved these changes Jul 31, 2023

View reviewed changes

jeancochrane merged commit f0e4c72 into data-catalog Aug 1, 2023

jeancochrane deleted the jeancochrane/28-data-catalog-add-production-profile-to-the-dbt-configuration branch August 1, 2023 15:06

jeancochrane mentioned this pull request Aug 1, 2023

[Data catalog] Add production profile to the dbt configuration #28

Closed

jeancochrane mentioned this pull request Aug 9, 2023

Rename dbt models to remove the _test suffix #59

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust dbt profiles and schema macro to support prod/CI targets #44

Adjust dbt profiles and schema macro to support prod/CI targets #44

jeancochrane commented Jul 31, 2023 •

edited

Loading

jeancochrane Jul 31, 2023

jeancochrane commented Jul 31, 2023

dfsnow Jul 31, 2023 •

edited

Loading

jeancochrane Jul 31, 2023

jeancochrane Jul 31, 2023

dfsnow Jul 31, 2023

dfsnow commented Jul 31, 2023

Adjust dbt profiles and schema macro to support prod/CI targets #44

Adjust dbt profiles and schema macro to support prod/CI targets #44

Conversation

jeancochrane commented Jul 31, 2023 • edited Loading

jeancochrane Jul 31, 2023

Choose a reason for hiding this comment

jeancochrane commented Jul 31, 2023

dfsnow Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

jeancochrane Jul 31, 2023

Choose a reason for hiding this comment

jeancochrane Jul 31, 2023

Choose a reason for hiding this comment

dfsnow Jul 31, 2023

Choose a reason for hiding this comment

dfsnow commented Jul 31, 2023

jeancochrane commented Jul 31, 2023 •

edited

Loading

dfsnow Jul 31, 2023 •

edited

Loading