Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust dbt profiles and schema macro to support prod/CI targets #44

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Jul 31, 2023

This PR adjusts our dbt profiles and our custom generate_schema_name macro to support production and CI targets for the data catalog.

The way this configuration works is:

  • We can choose which target environment we want to build or test using the --target flag that is accepted by all dbt commands. If no flag is provided, the default environment is dev.
  • When building a target in a specific environment, dbt will build the target into an Athena database with an automatic prefix prepended to the database name in order to namespace the environment. We choose to use prefix namespacing to distinguish between environments instead of building databases in isolated data catalogs because we need to leverage prebuilt tables from the default catalog in our dev and CI environments in order to save time when running running build and test commands.
    • The dev environment adds a prefix to the database name matching to the caller's Unix username, e.g. the output database will be called dev-jecochr-default for targets built into the database named default when dbt run is run on my machine.
    • The ci environment adds a prefix matching to the name of the branch, e.g. ci-feature-branch-1-default, with special characters removed and all letters forcibly lowercased to match dbt-athena's requirements.
    • The prod environment does not add any prefix, e.g. default.

A quick test confirming this behavior on my machine:

(venv) jecochr@21CCAO-LAPTOP54:~/code/data-architecture/dbt$ dbt run --select vw_pin10_location_test
16:49:17  Running with dbt=1.5.4
16:49:18  Registered adapter: athena=1.5.1
16:49:18  Unable to do partial parsing because change detected to override macro. Starting full parse.
16:49:18  Found 10 models, 19 tests, 0 snapshots, 0 analyses, 456 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
16:49:18
16:49:22  Concurrency: 5 threads (target='dev')
16:49:22
16:49:22  1 of 1 START sql view model jecochr-location.vw_pin10_location_test ............ [RUN]
16:49:25  1 of 1 OK created sql view model jecochr-location.vw_pin10_location_test ....... [OK -1 in 2.71s]
16:49:25
16:49:25  Finished running 1 view model in 0 hours 0 minutes and 6.22 seconds (6.22s).
16:49:25
16:49:25  Completed successfully
16:49:25
16:49:25  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

This PR also adds a new framework for unit testing macros. Macro tests can now be run via the following command:

$ dbt run-operation test_all

Sample output:

(venv) jecochr@21CCAO-LAPTOP54:~/code/data-architecture/dbt$ dbt run-operation test_all
20:51:23  Running with dbt=1.5.4
20:51:23  Registered adapter: athena=1.5.1
20:51:24  Found 10 models, 19 tests, 0 snapshots, 0 analyses, 475 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
20:51:24  test_kebab_slugify_lowercases_strings - PASS
20:51:24  test_kebab_slugify_replaces_spaces - PASS
20:51:24  test_kebab_slugify_replaces_slashes - PASS
20:51:24  test_kebab_slugify_replaces_underscores - PASS
20:51:24  test_kebab_slugify_removes_special_characters - PASS
20:51:24  test_kebab_slugify_handles_leading_numbers - PASS
20:51:24  test_generate_schema_name_handles_dev_env - PASS
20:51:24  test_generate_schema_name_handles_ci_env - PASS
20:51:24  test_generate_schema_name_handles_prod_env - PASS
20:51:24  test_generate_schema_name_raises_for_default_schema_name - PASS

Closes #28.

dbt/dbt_project.yml Show resolved Hide resolved
Comment on lines +21 to +23
# Prefix all generated data by schema, so that we can delete it when the
# PR is merged
s3_data_naming: schema_table
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not totally sure if this works yet, since we're not yet using dbt to manage our CTAs, but the idea here is that eventually when dbt builds CTAs and stores them in S3 the schema_table config will tell it to store those files in the configured s3_data_dir bucket with the path {s3_data_dir}/{schema}/{table}/ (docs). That way when we're ready to have CD clean up the resources generated for the PR, we can use the schema name defined by the generate_schema_name to delete everything in {s3_data_dir}/{schema}/ and leave resources that are being used by other PRs intact.

dbt/profiles.yml Show resolved Hide resolved
@jeancochrane jeancochrane force-pushed the jeancochrane/28-data-catalog-add-production-profile-to-the-dbt-configuration branch from 6c0fa37 to 902f61d Compare July 31, 2023 17:36
@jeancochrane jeancochrane marked this pull request as ready for review July 31, 2023 17:47
@jeancochrane jeancochrane requested a review from a team as a code owner July 31, 2023 17:47
@jeancochrane
Copy link
Contributor Author

Requesting both @dfsnow and @wrridgeway since this PR makes some major architectural choices about how we plan to structure dbt environments, but I only need a detailed review from one of you!

@@ -48,14 +48,24 @@ Build the models to create views in our Athena warehouse:
dbt run
```

By default, all `dbt` commands will run against the `dev` environment, which
namespaces the resources it creates by prefixing target database names with
your Unix `$USER` name (e.g. `jecochr-default` for the `default` database when
Copy link
Member

@dfsnow dfsnow Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): How does this work when the name of a branch includes an unexpected character (i.e. the / in jeancochrane/28-data-catalog-add-production-profile-to-the-dbt-configuration)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I think we can replace special characters (and uppercase letters, which are not allowed in dbt-athena) using Jinja template filters. I'll take a stab at that now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfsnow This spiraled a little bit, but I was able to make it work in 181e516! It required a new custom macro, but that led to me realize that we were starting to build a lot of logic into a collection of untested macros, so I took a minute to design a quick unit testing framework for our macros. Curious what you think!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Nice work and definitely needed since this is getting more complex. I'm a dumb dumb and commented on a commit instead of the PR. Linking here: 181e516#r123228596

@dfsnow
Copy link
Member

dfsnow commented Jul 31, 2023

Overall, I think this looks great. My one note is that it might be worthwhile to prefix the prefix with something that identifies a dev table so that we can easily grep them using the CLI (I suggest dev-).

So jecochr-location.vw_pin10_location_test would become dev-jecochr-location.vw_pin10_location_test.

@jeancochrane jeancochrane requested a review from dfsnow July 31, 2023 21:02
@jeancochrane jeancochrane merged commit f0e4c72 into data-catalog Aug 1, 2023
@jeancochrane jeancochrane deleted the jeancochrane/28-data-catalog-add-production-profile-to-the-dbt-configuration branch August 1, 2023 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants