Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dbt tests to model databases #686

Merged
merged 12 commits into from
Dec 27, 2024
Merged

Conversation

Damonamajor
Copy link
Contributor

@Damonamajor Damonamajor commented Dec 20, 2024

This PR provides DBT tests which fail at a set number of errors. There are three tests which currently fail which should be able to be fixed by deleting the duplicated data.

  • model_timing: relaxed-Tristan and stupefied-maya
  • model_parameter_final: clever-kyra
  • model_feature_importance: sad-tristan

The other tests are set to warnings with a set value. If we upload duplicated data based on the keys, warnings will transition to error messages, and we will know to further debug. These were set a few days ago, and didn't error during the final commit, so it doesn't seem to be a recurring issue.

Keys are updated with two new variables which should cause the data to be unique.

@Damonamajor Damonamajor linked an issue Dec 24, 2024 that may be closed by this pull request
@Damonamajor Damonamajor marked this pull request as ready for review December 24, 2024 02:10
@Damonamajor Damonamajor requested a review from a team as a code owner December 24, 2024 02:10
tags:
- load_auto

- name: test_card
description: '{{ doc("table_test_card") }}'
data_tests:
- unique_combination_of_columns:
name: model_test_card_unique
Copy link
Contributor Author

@Damonamajor Damonamajor Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one has a super high error level. Is there a key that is missing? It seems like it also needs to be unique on sale if there are multiple sales per parcel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty bizarre. I don't think you're missing any key columns here. It's possible that in the past we had separate test set data frames for the linear vs lgbm model results, but I don't see any column that would separate them. IMO as long as more recent model runs aren't adding to the error count it's fine.

Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonamajor This looks great. Some super minor changes to make and then this is good to merge.

@@ -34,7 +34,7 @@ Overall feature importance by model run (`run_id`).
Includes metrics such as gain, cover, and frequency. This is the output
of the built-in LightGBM/XGBoost feature importance methods.

**Primary Key**: `year`, `run_id`, `model_predictor_name_all`
**Primary Key**: `year`, `run_id`, `model_predictor_all_name`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you for updating these docs.

@@ -188,4 +188,4 @@ View to compile PIN-level model inputs shared between the residential
(`model.vw_card_res_input`) and condo (`model.vw_pin_condo_input`) model views.

**Primary Key**: `year`, `run_id`, `meta_pin`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: None of the model.vw_ views actually have run_id as a component of their primary key. Would be nice to remove it for those views.

- meta_year
- run_id
config:
error_if: ">5748"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: It would be handy to have a comment here about why we're setting this threshold i.e. their are past errors that we don't intend to fix but we want to prevent future dupes.

dbt/models/model/schema.yml Outdated Show resolved Hide resolved
data_tests:
- unique_combination_of_columns:
name: model_parameter_search_unique_by_year_run_id_and_iteration
combination_of_columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: I believe the full primary key here is year, run_id, iteration, configuration, fold_id. Let's update this test and the docs to reflect that.

config:
error_if: ">5748"
meta:
description: assessment card should be unique by pin, card, year, and run_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I don't think we actually need a description for these tests, since that's mainly used as an input for constructing Excel QC workbooks. Since these tests are exclusive to GitHub Actions (i.e. are only data tests), the description isn't needed.

tags:
- load_auto

- name: test_card
description: '{{ doc("table_test_card") }}'
data_tests:
- unique_combination_of_columns:
name: model_test_card_unique
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty bizarre. I don't think you're missing any key columns here. It's possible that in the past we had separate test set data frames for the linear vs lgbm model results, but I don't see any column that would separate them. IMO as long as more recent model runs aren't adding to the error count it's fine.

@Damonamajor Damonamajor requested a review from dfsnow December 24, 2024 19:59
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonamajor Let's add failure thresholds to the last remaining failing tests, then have @jeancochrane take a look at this. After that it should be set to go.

@@ -138,7 +138,7 @@ The test set is the out-of-sample data used to evaluate model performance.
Predictions in this table are trained using only data _not in this set
of sales_.

**Primary Key**: `year`, `run_id`, `meta_pin`, `meta_card_num`
**Primary Key**: `year`, `run_id`, `meta_pin`, `meta_card_num`, `document_number`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (non-blocking): There's not (yet) a doc number column in this dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a variable do we want it?
meta_sale_document_num

@Damonamajor
Copy link
Contributor Author

Damonamajor commented Dec 26, 2024

@Damonamajor Let's add failure thresholds to the last remaining failing tests, then have @jeancochrane take a look at this. After that it should be set to go.

@dfsnow
The remaining failing tests should be easily fixed (with the notes in the original comment). Do we want to fix them or add the thresholds?

@dfsnow dfsnow merged commit 56ddbd5 into master Dec 27, 2024
7 checks passed
@dfsnow dfsnow deleted the Add-DBT-tests-to-model-databases branch December 27, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add dbt data integrity tests to the model database tables
2 participants