-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dbt tests to model databases #686
Changes from 6 commits
e53df77
2b38351
2ad3de7
05b530d
ec4ce57
38d4fb2
ffb2697
eda20d4
bcf8edf
bb4823d
c18cc98
812c80a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,7 +34,7 @@ Overall feature importance by model run (`run_id`). | |
Includes metrics such as gain, cover, and frequency. This is the output | ||
of the built-in LightGBM/XGBoost feature importance methods. | ||
|
||
**Primary Key**: `year`, `run_id`, `model_predictor_name_all` | ||
**Primary Key**: `year`, `run_id`, `model_predictor_all_name` | ||
{% enddocs %} | ||
|
||
# final_model | ||
|
@@ -77,7 +77,7 @@ If hyperparameters are blank for a given run, then that parameter was not used. | |
Range of hyperparameters searched by a given model run (`run_id`) | ||
during cross-validation. | ||
|
||
**Primary Key**: `year`, `run_id` | ||
**Primary Key**: `year`, `run_id`, `parameter_name` | ||
{% enddocs %} | ||
|
||
# parameter_search | ||
|
@@ -113,7 +113,7 @@ The stages are: | |
Identical to `model.performance`, but additionally broken out by quantile. | ||
|
||
**Primary Key**: `year`, `run_id`, `stage`, `triad_code`, `geography_type`, | ||
`geography_id`, `by_class`, `quantile` | ||
`geography_id`, `by_class`, `num_quantile`, `quantile` | ||
{% enddocs %} | ||
|
||
# shap | ||
|
@@ -188,4 +188,4 @@ View to compile PIN-level model inputs shared between the residential | |
(`model.vw_card_res_input`) and condo (`model.vw_pin_condo_input`) model views. | ||
|
||
**Primary Key**: `year`, `run_id`, `meta_pin` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion: None of the |
||
{% enddocs %} | ||
{% enddocs %} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,61 +4,192 @@ sources: | |
tables: | ||
- name: assessment_card | ||
description: '{{ doc("table_assessment_card") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_assessment_card_unique_by_pin_card_and_year | ||
combination_of_columns: | ||
- meta_pin | ||
- meta_card_num | ||
- meta_year | ||
- run_id | ||
config: | ||
error_if: ">5748" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion: It would be handy to have a comment here about why we're setting this threshold i.e. their are past errors that we don't intend to fix but we want to prevent future dupes. |
||
meta: | ||
description: assessment card should be unique by pin, card, year, and run_id | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion: I don't think we actually need a description for these tests, since that's mainly used as an input for constructing Excel QC workbooks. Since these tests are exclusive to GitHub Actions (i.e. are only data tests), the description isn't needed. |
||
tags: | ||
- load_auto | ||
|
||
- name: assessment_pin | ||
description: '{{ doc("table_assessment_pin") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_assessment_pin_unique_by_pin_year_and_run_id | ||
combination_of_columns: | ||
- meta_pin | ||
- meta_year | ||
- run_id | ||
config: | ||
error_if: ">2016" | ||
meta: | ||
description: assessment pin should be unique by pin, year, and run_id | ||
tags: | ||
- load_auto | ||
|
||
- name: feature_importance | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_feature_importance_unique | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
- model_predictor_all_name | ||
meta: | ||
description: feature importance should be unique by year, run_id, and model_predictor_all_name | ||
description: '{{ doc("table_feature_importance") }}' | ||
tags: | ||
- load_auto | ||
|
||
- name: metadata | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_metadata_unique_by_year_and_run_id | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
meta: | ||
description: metadata should be unique by year and run_id | ||
description: '{{ doc("table_metadata") }}' | ||
tags: | ||
- load_auto | ||
|
||
- name: parameter_final | ||
description: '{{ doc("table_parameter_final") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_parameter_final_unique_by_year_and_run_id | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
meta: | ||
description: parameter final should be unique by year and run_id | ||
tags: | ||
- load_auto | ||
|
||
- name: parameter_range | ||
description: '{{ doc("table_parameter_range") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_parameter_range_unique_by_year_run_id_and_parameter_name | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
- parameter_name | ||
meta: | ||
description: parameter range should be unique by year run_id and parameter_name | ||
Damonamajor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
tags: | ||
- load_auto | ||
|
||
- name: parameter_search | ||
description: '{{ doc("table_parameter_search") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_parameter_search_unique_by_year_run_id_and_iteration | ||
combination_of_columns: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue: I believe the full primary key here is |
||
- year | ||
- run_id | ||
- iteration | ||
config: | ||
error_if: ">2136" | ||
meta: | ||
description: parameter search should be unique by year, run_id, and iteration | ||
tags: | ||
- load_auto | ||
|
||
- name: performance | ||
description: '{{ doc("table_performance") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_performance_unique | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
- stage | ||
- triad_code | ||
- geography_type | ||
- geography_id | ||
- class | ||
meta: | ||
description: performance should be unique by year, run_id, stage, triad_code, geography_type, geography_id, and class | ||
tags: | ||
- load_auto | ||
|
||
- name: performance_quantile | ||
description: '{{ doc("table_performance_quantile") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_performance_quantile_unique | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
- triad_code | ||
- stage | ||
- geography_type | ||
- geography_id | ||
- class | ||
- num_quantile | ||
- quantile | ||
meta: | ||
description: > | ||
performance quantile should be unique by year, run_id, stage, triad_code, | ||
geography_type, class, geography_id, num_quantile, and quantile | ||
tags: | ||
- load_auto | ||
|
||
- name: shap | ||
description: '{{ doc("table_shap") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_shap_unique_by_year_run_id_meta_pin_meta_and_card_num | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
- meta_pin | ||
- meta_card_num | ||
config: | ||
error_if: ">524" | ||
meta: | ||
description: shap should be unique by year, run_id, meta_pin, and meta_card_num | ||
tags: | ||
- load_auto | ||
|
||
- name: test_card | ||
description: '{{ doc("table_test_card") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_test_card_unique | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This one has a super high error level. Is there a key that is missing? It seems like it also needs to be unique on sale if there are multiple sales per parcel. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's pretty bizarre. I don't think you're missing any key columns here. It's possible that in the past we had separate test set data frames for the linear vs lgbm model results, but I don't see any column that would separate them. IMO as long as more recent model runs aren't adding to the error count it's fine. |
||
combination_of_columns: | ||
- year | ||
- run_id | ||
- meta_pin | ||
- meta_card_num | ||
- meta_sale_document_num | ||
config: | ||
error_if: ">102422" | ||
meta: | ||
description: test card should be unique by year, run_id, meta_pin, meta_card_num, and meta_sale_document_num | ||
tags: | ||
- load_auto | ||
|
||
- name: timing | ||
description: '{{ doc("table_timing") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_timing_unique_by_year_run_id | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
meta: | ||
description: timing should be unique by year and run_id | ||
tags: | ||
- load_auto | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for updating these docs.