-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dbt tests to model databases #686
Conversation
dbt/models/model/schema.yml
Outdated
tags: | ||
- load_auto | ||
|
||
- name: test_card | ||
description: '{{ doc("table_test_card") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_test_card_unique |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one has a super high error level. Is there a key that is missing? It seems like it also needs to be unique on sale if there are multiple sales per parcel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty bizarre. I don't think you're missing any key columns here. It's possible that in the past we had separate test set data frames for the linear vs lgbm model results, but I don't see any column that would separate them. IMO as long as more recent model runs aren't adding to the error count it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonamajor This looks great. Some super minor changes to make and then this is good to merge.
@@ -34,7 +34,7 @@ Overall feature importance by model run (`run_id`). | |||
Includes metrics such as gain, cover, and frequency. This is the output | |||
of the built-in LightGBM/XGBoost feature importance methods. | |||
|
|||
**Primary Key**: `year`, `run_id`, `model_predictor_name_all` | |||
**Primary Key**: `year`, `run_id`, `model_predictor_all_name` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for updating these docs.
dbt/models/model/docs.md
Outdated
@@ -188,4 +188,4 @@ View to compile PIN-level model inputs shared between the residential | |||
(`model.vw_card_res_input`) and condo (`model.vw_pin_condo_input`) model views. | |||
|
|||
**Primary Key**: `year`, `run_id`, `meta_pin` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: None of the model.vw_
views actually have run_id
as a component of their primary key. Would be nice to remove it for those views.
dbt/models/model/schema.yml
Outdated
- meta_year | ||
- run_id | ||
config: | ||
error_if: ">5748" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: It would be handy to have a comment here about why we're setting this threshold i.e. their are past errors that we don't intend to fix but we want to prevent future dupes.
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_parameter_search_unique_by_year_run_id_and_iteration | ||
combination_of_columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: I believe the full primary key here is year
, run_id
, iteration
, configuration
, fold_id
. Let's update this test and the docs to reflect that.
dbt/models/model/schema.yml
Outdated
config: | ||
error_if: ">5748" | ||
meta: | ||
description: assessment card should be unique by pin, card, year, and run_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I don't think we actually need a description for these tests, since that's mainly used as an input for constructing Excel QC workbooks. Since these tests are exclusive to GitHub Actions (i.e. are only data tests), the description isn't needed.
dbt/models/model/schema.yml
Outdated
tags: | ||
- load_auto | ||
|
||
- name: test_card | ||
description: '{{ doc("table_test_card") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_test_card_unique |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty bizarre. I don't think you're missing any key columns here. It's possible that in the past we had separate test set data frames for the linear vs lgbm model results, but I don't see any column that would separate them. IMO as long as more recent model runs aren't adding to the error count it's fine.
Co-authored-by: Dan Snow <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonamajor Let's add failure thresholds to the last remaining failing tests, then have @jeancochrane take a look at this. After that it should be set to go.
dbt/models/model/docs.md
Outdated
@@ -138,7 +138,7 @@ The test set is the out-of-sample data used to evaluate model performance. | |||
Predictions in this table are trained using only data _not in this set | |||
of sales_. | |||
|
|||
**Primary Key**: `year`, `run_id`, `meta_pin`, `meta_card_num` | |||
**Primary Key**: `year`, `run_id`, `meta_pin`, `meta_card_num`, `document_number` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick (non-blocking): There's not (yet) a doc number column in this dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a variable do we want it?
meta_sale_document_num
@dfsnow |
This PR provides DBT tests which fail at a set number of errors. There are three tests which currently fail which should be able to be fixed by deleting the duplicated data.
The other tests are set to warnings with a set value. If we upload duplicated data based on the keys, warnings will transition to error messages, and we will know to further debug. These were set a few days ago, and didn't error during the final commit, so it doesn't seem to be a recurring issue.
Keys are updated with two new variables which should cause the data to be unique.