Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define GitHub actions workflows for building and testing dbt #50

Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
88494fd
Define GitHub workflows for building and testing dbt models
jeancochrane Aug 2, 2023
997f3e4
Configure AWS credentials in GitHub Actions build and test workflows
jeancochrane Aug 2, 2023
bc0e000
Centralize dbt env vars in GitHub Actions workflows
jeancochrane Aug 2, 2023
ff4fd81
Tweaks to dbt GitHub Actions workflow definition ahead of testing
jeancochrane Aug 3, 2023
297f7ec
Fix linting problems with dbt GitHub actions and workflows
jeancochrane Aug 3, 2023
0396849
Empty commit to trigger CI build
jeancochrane Aug 3, 2023
f833271
Rename local GitHub actions to match expected dir/action.yaml naming …
jeancochrane Aug 3, 2023
3fb81f6
Add permissions to interact with GitHub OIDC to dbt actions
jeancochrane Aug 3, 2023
5b37bf8
Try new format for build and test cache key on CI
jeancochrane Aug 3, 2023
e8ec9d0
Strip extraneous dollar sign from dbt workflow cache key
jeancochrane Aug 3, 2023
f4ba9de
Try different format for dbt directory paths in workflow env vars
jeancochrane Aug 3, 2023
8530109
Rename dbt workflow env vars to avoid collisions with dbt internal en…
jeancochrane Aug 3, 2023
b26cefc
Make sure STATE_ARGS env var is never empty in build_and_test_dbt wor…
jeancochrane Aug 3, 2023
3ff6d98
Try new format for reading dbt commands from env vars in GitHub workflow
jeancochrane Aug 3, 2023
0607be7
Add step to build_and_test_dbt workflow to test dbt installation
jeancochrane Aug 3, 2023
98a42ef
Try a different quoting scheme for RUN_CMD and TEST_CMD in build_and_…
jeancochrane Aug 3, 2023
03e4dc1
Define build/test commands directly instead of via env vars in dbt wo…
jeancochrane Aug 3, 2023
b99d0da
Log all conditional branches in build_and_test_dbt_models workflow
jeancochrane Aug 3, 2023
829a456
See if removing hyphens from database names appeases dbt-athena on CI
jeancochrane Aug 3, 2023
cb0299b
Merge data-catalog into jeancochrane/31-data-catalog-define-github-ac…
jeancochrane Aug 3, 2023
31de4e0
Temporarily enable dbt debugging to try to figure out AWS permissions
jeancochrane Aug 3, 2023
7b0037a
Try reverting dbt schema naming back to kebab_slugify
jeancochrane Aug 4, 2023
9bf6cac
Remove --debug flag from dbt run call in build_and_test_dbt workflow
jeancochrane Aug 4, 2023
76db394
Bump error thresholds for four dbt tests
jeancochrane Aug 4, 2023
4d552aa
Add step to cleanup resources to build_and_test_dbt workflow
jeancochrane Aug 4, 2023
27f556c
Clean up cleanup_dbt_resources.sh script for use in CI
jeancochrane Aug 7, 2023
15a86d2
Bump allowed errors in dbt tests due to data problems
jeancochrane Aug 7, 2023
7ab4b69
Update build_and_test_dbt workflow to run when PRs are closed
jeancochrane Aug 7, 2023
9304b2f
Try apt-get instead of apt for installing jq in build_and_test_dbt wo…
jeancochrane Aug 7, 2023
b291800
Temporarily disable PR event restriction on dbt cleanup install step …
jeancochrane Aug 7, 2023
b4ec438
Try sudo apt-get for installing jq in build_and_test_dbt workflow
jeancochrane Aug 7, 2023
52fc11a
Remove installation step for dbt cleanup in build_and_test_dbt workflow
jeancochrane Aug 7, 2023
640387f
Enforce jq as a requirement for cleanup_dbt_resources.sh script
jeancochrane Aug 7, 2023
dfe8785
Fix path to cleanup_dbt_resources.sh on CI
jeancochrane Aug 7, 2023
6d685a0
Revert "Remove installation step for dbt cleanup in build_and_test_db…
jeancochrane Aug 7, 2023
1e5331d
Temporarily disable PR event restriction on cleanup in build_and_test…
jeancochrane Aug 7, 2023
620877b
Revert "Enforce jq as a requirement for cleanup_dbt_resources.sh script"
jeancochrane Aug 7, 2023
f2d0508
Revert "Temporarily disable PR event restriction on cleanup in build_…
jeancochrane Aug 7, 2023
cf5995b
Revert "Temporarily disable PR event restriction on dbt cleanup insta…
jeancochrane Aug 7, 2023
d34c83b
Temporarily run test_dbt_models workflow on PRs so we can dispatch it…
jeancochrane Aug 7, 2023
d2909ea
Give more verbose names to dbt workflow jobs
jeancochrane Aug 7, 2023
7538f9e
Revert "Temporarily run test_dbt_models workflow on PRs so we can dis…
jeancochrane Aug 7, 2023
c3a7a62
Try adding push to test_dbt_models workflow definition to test dispatch
jeancochrane Aug 7, 2023
e10540f
Revert "Try adding push to test_dbt_models workflow definition to tes…
jeancochrane Aug 7, 2023
b5cec8d
Add docstring to cleanup_dbt_resources.sh
jeancochrane Aug 7, 2023
68f914c
Run `dbt run` with --defer on CI to inherit built resources
jeancochrane Aug 7, 2023
1d4ec5a
Don't use build cache in test_dbt_models workflow
jeancochrane Aug 7, 2023
8af875c
Try adding push to test_dbt_models workflow definition to test it again
jeancochrane Aug 7, 2023
87353b2
Temporarily add --debug flag to dbt call in test_dbt_models
jeancochrane Aug 7, 2023
37ba0bb
Change `push` to `pull_request` for testing test_dbt_models workflow
jeancochrane Aug 7, 2023
b36fec7
Remove --debug flag from test_dbt_models workflow
jeancochrane Aug 7, 2023
5bdf9a2
Change test_dbt_models workflow to only run on dispatch
jeancochrane Aug 7, 2023
cef322a
Cache dbt and Python requirements in install_dbt_requirements action
jeancochrane Aug 8, 2023
e51c677
Use sed to strip comment lines in load_environment_variables composit…
jeancochrane Aug 8, 2023
f9b5d86
Kebab case dbt build and test workflow names
jeancochrane Aug 8, 2023
704ed13
Factor out cleanup-dbt-resources into its own workflow
jeancochrane Aug 8, 2023
1738374
Set GITHUB_HEAD_REF var in test_dbt_models workflow
jeancochrane Aug 8, 2023
672a7d4
Bump threshold for vw_pin_appeal dbt test failures
jeancochrane Aug 8, 2023
60c2bb1
Factor out composite GitHub action for configure_dbt_environment to s…
jeancochrane Aug 8, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .github/actions/install_dbt_requirements/action.yaml
dfsnow marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: Install dbt dependencies
description: Installs Python and dbt requirements for a workflow
inputs:
dbt_project_dir:
description: Path to the directory containing the dbt project.
required: false
default: ./dbt
requirements_file_path:
description: Path to Python requirements file.
required: false
default: ./dbt/requirements.txt
Comment on lines +3 to +11
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not strictly necessary for us to factor out these variables, since we don't currently have plans to reuse this action outside of our current dbt project, but the GitHub Actions docs recommend using variables instead of hardcoded paths so I'm doing so here:

We strongly recommend that actions use variables to access the filesystem rather than using hardcoded file paths.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually didn't know you could re-use in-repo actions. I sillily used a composable workflow in PTAXSIM instead. Will fix: ccao-data/ptaxsim#15. Thanks for teaching me something today!

runs:
using: composite
steps:
- name: Setup python
uses: actions/setup-python@v4
with:
python-version: 3.x

- name: Install python requirements
run: python -m pip install -r ${{ inputs.requirements_file_path }}
shell: bash

- name: Install dbt requirements
run: dbt deps
working-directory: ${{ inputs.dbt_project_dir }}
shell: bash
15 changes: 15 additions & 0 deletions .github/actions/load_environment_variables/action.yaml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, both workflows share a common set of environment variables that are loaded using this action.

Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
name: Load environment variables
description: Configures environment variables for a workflow
inputs:
env_var_file_path:
description: |
File path to variable file or directory.
Defaults to ./.github/variables/* if none specified
and runs against each file in that directory.
required: false
default: ./.github/variables/*
runs:
using: composite
steps:
- run: sed "" ${{ inputs.env_var_file_path }} >> "$GITHUB_ENV"
dfsnow marked this conversation as resolved.
Show resolved Hide resolved
shell: bash
35 changes: 35 additions & 0 deletions .github/scripts/cleanup_dbt_resources.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/usr/bin/env bash
# Clean up dbt resources created by a CI run or by local development.
#
# Takes one argument representing the target environment to clean up,
# one of `dev` or `ci`. E.g.:
#
# ./cleanup_dbt_resources.sh dev
#
# Assumes that jq is installed and available on the caller's path.
set -euo pipefail

if [[ "$#" -eq 0 ]]; then
echo "Missing first argument representing dbt target"
exit 1
fi

if [ "$1" == "prod" ]; then
echo "Target cannot be 'prod'"
exit 1
fi

schemas_json=$(dbt --quiet list --resource-type model --target "$1" \
--output json --output-keys schema) || (echo "Error in dbt call" && exit 1)
schemas=$(echo "$schemas_json"| sort | uniq | jq ' .schema') || (\
echo "Error in schema parsing" && exit 1
)

echo "Deleting the following schemas from Athena:"
echo
echo "$schemas"

echo "$schemas" | xargs -i bash -c 'aws glue delete-database --name {} || exit 255'
dfsnow marked this conversation as resolved.
Show resolved Hide resolved

echo
echo "Done!"
3 changes: 3 additions & 0 deletions .github/variables/dbt.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
CACHE_NAME=dbt-cache
MANIFEST_DIR=dbt/target
PROJECT_DIR=dbt
122 changes: 122 additions & 0 deletions .github/workflows/build_and_test_dbt.yaml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow represents the build and test that should run on every PR and push to our main branch.

Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
name: Build and test dbt
dfsnow marked this conversation as resolved.
Show resolved Hide resolved

on:
pull_request:
branches: [master, data-catalog]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of references to data-catalog here that cover for the fact that we don't expect this workflow to be pulled into master immediately, and that we want to treat data-catalog as a CI environment until we're ready to merge data-catalog into master (#55). Part of that merge issue will be stripping out the references to data-catalog in these workflows and ensuring that they are ready to be run against master.

# Specifying event types manually allows us to run this flow when the
# PR is closed so that we can clean up staging dbt resources
types:
- opened
- synchronize
- closed
- reopened
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these event types represent the default event types plus closed. Docs:

By default, a workflow only runs when a pull_request event's activity type is opened, synchronize, or reopened.

push:
branches: [master, data-catalog]

jobs:
build-and-test-dbt:
runs-on: ubuntu-latest
# These permissions are needed to interact with GitHub's OIDC Token endpoint
# so that we can authenticate with AWS
permissions:
id-token: write
contents: read
Comment on lines +12 to +16
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More details on these permissions settings here.

steps:
- name: Checkout
uses: actions/checkout@v3

- name: Install dbt requirements
uses: ./.github/actions/install_dbt_requirements

- name: Load environment variables
uses: ./.github/actions/load_environment_variables

- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: ${{ secrets.AWS_IAM_ROLE_TO_ASSUME_ARN }}
aws-region: us-east-1

- name: Set environment for branch
run: |
if [[ $GITHUB_REF_NAME == 'master' ]]; then
echo "On master branch"
{
echo "TARGET=prod";
echo "CACHE_KEY=master";
} >> "$GITHUB_ENV"
elif [[ $GITHUB_REF_NAME == 'data-catalog' ]]; then
echo "On data catalog branch"
{
echo "TARGET=ci";
echo "CACHE_KEY=data-catalog";
echo "GITHUB_HEAD_REF=data-catalog";
} >> "$GITHUB_ENV"
dfsnow marked this conversation as resolved.
Show resolved Hide resolved
else
echo "On pull request branch"
{
echo "TARGET=ci";
echo "CACHE_KEY=$GITHUB_HEAD_REF";
} >> "$GITHUB_ENV"
fi
shell: bash

- name: Cache dbt manifest
id: cache
uses: actions/cache@v3
with:
path: ${{ env.MANIFEST_DIR }}
key: ${{ env.CACHE_NAME }}-${{ env.CACHE_KEY }}
restore-keys: |
${{ env.CACHE_NAME }}-data-catalog
${{ env.CACHE_NAME }}-master
Comment on lines +42 to +44
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restore-keys setting here means that we will fall back to the data-catalog or master branch caches if the PR branch cache does not exist (which we only ever expect to happen on the first pushes to a PR, before the PR has had a successful run).


- if: ${{ steps.cache.outputs.cache-hit == 'true' }}
name: Set command args to build/test modified resources
run: echo "MODIFIED_RESOURCES_ONLY=true" >> "$GITHUB_ENV"
shell: bash

- if: ${{ steps.cache.outputs.cache-hit != 'true' }}
name: Set command args to build/test all resources
run: echo "MODIFIED_RESOURCES_ONLY=false" >> "$GITHUB_ENV"
shell: bash

- name: Test dbt macros
run: dbt run-operation test_all
working-directory: ${{ env.PROJECT_DIR }}
shell: bash

- name: Build models
run: |
if [[ $MODIFIED_RESOURCES_ONLY == 'true' ]]; then
echo "Running build on modified resources only"
dbt run --target "$TARGET" -s state:modified --defer --state target/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things are going on here:

  1. --target is ensuring that we're running against the correct environment (CI or prod)
  2. -s state:modified is only selecting resources that have changed since the last run
  3. --defer is instructing dbt to reuse resources that are marked as created in the state file, even if they don't exist in the current environment
    1. For example, if we are on a pull request branch where we have restored the cache from the data-catalog branch, dbt will reuse resources created in the data-catalog environment without needing to recreate them in the PR environment (docs)
  4. --state instructs dbt on where to load the state file, which we keep cached

else
echo "Running build on all resources"
dbt run --target "$TARGET"
fi
working-directory: ${{ env.PROJECT_DIR }}
shell: bash

- name: Test models
run: |
if [[ $MODIFIED_RESOURCES_ONLY == 'true' ]]; then
echo "Running tests on modified resources only"
dbt test --target "$TARGET" -s state:modified --state target/
else
echo "Running tests on all resources"
dbt test --target "$TARGET"
fi
working-directory: ${{ env.PROJECT_DIR }}
shell: bash

- if: ${{ github.event_name == 'pull_request' && github.event.action == 'closed' }}
dfsnow marked this conversation as resolved.
Show resolved Hide resolved
name: Install requirements for cleaning up dbt resources
run: sudo apt-get update && sudo apt-get install jq
shell: bash

- if: ${{ github.event_name == 'pull_request' && github.event.action == 'closed' }}
name: Clean up dbt resources
run: ../.github/scripts/cleanup_dbt_resources.sh ci
working-directory: ${{ env.PROJECT_DIR }}
shell: bash
32 changes: 32 additions & 0 deletions .github/workflows/test_dbt_models.yaml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow represents the tests that we will want to trigger from our nightly import process (ccao-data/service-sqoop-iasworld#1). It is not fully tested yet as part of this project, since manual workflow dispatch is not possible until the workflow has been pulled into the main branch (source). Instead, I tested the workflow by manually editing the on: key below to run the workflow on pushes to this pull request; test output here.

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Test dbt models

on: workflow_dispatch

jobs:
test-dbt-models:
runs-on: ubuntu-latest
# These permissions are needed to interact with GitHub's OIDC Token endpoint
# so that we can authenticate with AWS
permissions:
id-token: write
contents: read
steps:
- name: Checkout
uses: actions/checkout@v3

- name: Install dbt requirements
uses: ./.github/actions/install_dbt_requirements

- name: Load environment variables
uses: ./.github/actions/load_environment_variables

- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: ${{ secrets.AWS_IAM_ROLE_TO_ASSUME_ARN }}
aws-region: us-east-1

- name: Test models
run: dbt test --target ci
dfsnow marked this conversation as resolved.
Show resolved Hide resolved
working-directory: ${{ env.PROJECT_DIR }}
shell: bash
8 changes: 4 additions & 4 deletions dbt/models/default/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,15 @@ models:
- pin
- year
config:
error_if: ">280655"
error_if: ">280659"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have confirmed by running these tests locally that these changes represent additional malformed data that has been added to our source data since this PR opened. The fact that this happened twice during the ~week that this PR was open is an indication to me that I may want to pivot soon to focusing on resolving these data problems, since they're going to become annoying quite quickly once we're running these tests on every PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Some of these are likely just under-specified tests, but I agree it's going to be a problem. Grab @wrridgeway this week for a review/debugging session of the current tests. He knows more than anyone but Mirella about the horrors of our changing data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'll set aside some time for us tomorrow 👍🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I scheduled some time tomorrow afternoon! In the meantime, I had to bump the thresholds yet again in 672a7d4 ☹️

# Unique by case number and year
- unique_combination_of_columns:
name: vw_pin_appeal_unique_by_case_number_and_year
combination_of_columns:
- year
- case_no
config:
error_if: ">365779"
error_if: ">365855"
# `change` should be an enum
- dbt_utils.expression_is_true:
name: vw_pin_appeal_no_unexpected_change_values
Expand Down Expand Up @@ -85,7 +85,7 @@ models:
case when char_renovation = '1' then true else false end
)
config:
error_if: ">73925"
error_if: ">73941"
# TODO: Characteristics columns should adhere to pre-determined criteria
- name: vw_pin_address_test
description: '{{ doc("vw_pin_address_test") }}'
Expand All @@ -111,7 +111,7 @@ models:
- mail_address_zipcode_1
- mail_address_zipcode_2
config:
error_if: ">879261"
error_if: ">880581"
# TODO: Mailing address changes after validated sale(?)
# TODO: Site addresses are all in Cook County
- name: vw_pin_condo_char_test
Expand Down