Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data catalog] Define GitHub Actions workflows for building the dbt DAG and running tests #31

Closed
jeancochrane opened this issue Jul 26, 2023 · 1 comment
Assignees

Comments

@jeancochrane
Copy link
Contributor

jeancochrane commented Jul 26, 2023

Overview

We want to use GitHub Actions to run the dbt run and dbt test commands that build and test our DAG, respectively. To do this, we'll need to add GitHub Actions workflow definitions to this repository.

Workflow definitions

Our workflow definition should support two different types of flows:

  1. Rebuild all models that have changed since the last cached run, and run their tests
  2. Run all the tests, regardless of the result of the last cached run

These two flows will be used in two different ways:

  1. will provide continuous integration (CI) for dbt models in this repository, building and testing models when we make code changes to them; while
  2. will provide a test interface that we can call from the GitHub Actions workflow API to run data integrity checks after pulling fresh source data from the system of record each night.

Caching is important in CI to help speed up development cycles, but it's unnecessary in the context of our nightly data integrity checks, where we want to validate all of the data on each run.

Cache behavior

The CI workflow (1 above) should exhibit the following cache behavior:

  • On every PR:
    • Run the build and tests for models that have changed since last commit to master OR since the last successful workflow run for this PR
      • In other words: The first workflow run for any PR should use the cache from the master branch, and subsequent runs should use the cache from the first successful run on the PR branch
      • These builds and tests should run in a separate development environment, ideally one that is created exclusively for the PR and not shared by other PRs; we should use the same environment scheme set up in [Data catalog] Add production profile to the dbt configuration #28
      • The master branch cache should never be updated by this flow
  • On commits to master:
    • Run builds and tests for models that have changed since the last commit to master
      • These builds and tests should run against the prod Athena environment
      • The master branch cache should be updated when this flow succeeds

AWS credentials

In order to run dbt commands against Athena from the context of a GitHub Action workflow, we'll need to inject valid AWS credentials into the workflow. Credentials should be stored as encrypted secrets and should have their permissions restricted as much as possible to reduce the attack surface of the credentials. See the dbt-athena docs for a list of the required permissions for the adapter.

Incremental testing

Since our CI tests will only be useful if they can distinguish between expected and unexpected data integrity issues, this issue depends on #32.

@jeancochrane
Copy link
Contributor Author

Closed by #50.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant