-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dbt tests to model databases #686
Changes from 9 commits
e53df77
2b38351
2ad3de7
05b530d
ec4ce57
38d4fb2
ffb2697
eda20d4
bcf8edf
bb4823d
c18cc98
812c80a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,7 +34,7 @@ Overall feature importance by model run (`run_id`). | |
Includes metrics such as gain, cover, and frequency. This is the output | ||
of the built-in LightGBM/XGBoost feature importance methods. | ||
|
||
**Primary Key**: `year`, `run_id`, `model_predictor_name_all` | ||
**Primary Key**: `year`, `run_id`, `model_predictor_all_name` | ||
{% enddocs %} | ||
|
||
# final_model | ||
|
@@ -77,7 +77,7 @@ If hyperparameters are blank for a given run, then that parameter was not used. | |
Range of hyperparameters searched by a given model run (`run_id`) | ||
during cross-validation. | ||
|
||
**Primary Key**: `year`, `run_id` | ||
**Primary Key**: `year`, `run_id`, `parameter_name` | ||
{% enddocs %} | ||
|
||
# parameter_search | ||
|
@@ -86,7 +86,7 @@ during cross-validation. | |
Hyperparameters used for _every_ cross-validation iteration, along with | ||
the corresponding performance statistics. | ||
|
||
**Primary Key**: `year`, `run_id`, `iteration` | ||
**Primary Key**: `year`, `run_id`, `iteration`, `configuration`, `fold_id` | ||
{% enddocs %} | ||
|
||
# performance | ||
|
@@ -113,7 +113,7 @@ The stages are: | |
Identical to `model.performance`, but additionally broken out by quantile. | ||
|
||
**Primary Key**: `year`, `run_id`, `stage`, `triad_code`, `geography_type`, | ||
`geography_id`, `by_class`, `quantile` | ||
`geography_id`, `by_class`, `num_quantile`, `quantile` | ||
{% enddocs %} | ||
|
||
# shap | ||
|
@@ -138,7 +138,7 @@ The test set is the out-of-sample data used to evaluate model performance. | |
Predictions in this table are trained using only data _not in this set | ||
of sales_. | ||
|
||
**Primary Key**: `year`, `run_id`, `meta_pin`, `meta_card_num` | ||
**Primary Key**: `year`, `run_id`, `meta_pin`, `meta_card_num`, `document_number` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick (non-blocking): There's not (yet) a doc number column in this dataset. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a variable do we want it? |
||
{% enddocs %} | ||
|
||
# timing | ||
|
@@ -165,7 +165,7 @@ data cached by DVC when possible. See | |
[model-res-avm#getting-data](https://github.com/ccao-data/model-res-avm#getting-data) | ||
for more information. | ||
|
||
**Primary Key**: `year`, `run_id`, `meta_pin`, `meta_card_num` | ||
**Primary Key**: `year`, `meta_pin`, `meta_card_num` | ||
{% enddocs %} | ||
|
||
# vw_pin_condo_input | ||
|
@@ -178,7 +178,7 @@ Observations are at the PIN-14 (condo unit) level. Unlike the residential | |
input view, this view does not perform filling. Instead condo characteristics | ||
are backfilled in `default.vw_pin_condo_char`. | ||
|
||
**Primary Key**: `year`, `run_id`, `meta_pin` | ||
**Primary Key**: `year`, `meta_pin` | ||
{% enddocs %} | ||
|
||
# vw_pin_shared_input | ||
|
@@ -187,5 +187,5 @@ are backfilled in `default.vw_pin_condo_char`. | |
View to compile PIN-level model inputs shared between the residential | ||
(`model.vw_card_res_input`) and condo (`model.vw_pin_condo_input`) model views. | ||
|
||
**Primary Key**: `year`, `run_id`, `meta_pin` | ||
{% enddocs %} | ||
**Primary Key**: `year`, `meta_pin` | ||
{% enddocs %} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,61 +4,171 @@ sources: | |
tables: | ||
- name: assessment_card | ||
description: '{{ doc("table_assessment_card") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_assessment_card_unique_by_pin_card_and_year | ||
combination_of_columns: | ||
- meta_pin | ||
- meta_card_num | ||
- meta_year | ||
- run_id | ||
config: | ||
# We add a fixed level of errors since duplicated data exists before | ||
# these tests were implemented. If duplicated data is added after 12/24/2024, | ||
# warnings will transition to errors. | ||
error_if: ">5748" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion: It would be handy to have a comment here about why we're setting this threshold i.e. their are past errors that we don't intend to fix but we want to prevent future dupes. |
||
tags: | ||
- load_auto | ||
|
||
- name: assessment_pin | ||
description: '{{ doc("table_assessment_pin") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_assessment_pin_unique_by_pin_year_and_run_id | ||
combination_of_columns: | ||
- meta_pin | ||
- meta_year | ||
- run_id | ||
config: | ||
error_if: ">2016" | ||
tags: | ||
- load_auto | ||
|
||
- name: feature_importance | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_feature_importance_unique | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
- model_predictor_all_name | ||
description: '{{ doc("table_feature_importance") }}' | ||
tags: | ||
- load_auto | ||
|
||
- name: metadata | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_metadata_unique_by_year_and_run_id | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
description: '{{ doc("table_metadata") }}' | ||
tags: | ||
- load_auto | ||
|
||
- name: parameter_final | ||
description: '{{ doc("table_parameter_final") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_parameter_final_unique_by_year_and_run_id | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
tags: | ||
- load_auto | ||
|
||
- name: parameter_range | ||
description: '{{ doc("table_parameter_range") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_parameter_range_unique_by_year_run_id_and_parameter_name | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
- parameter_name | ||
tags: | ||
- load_auto | ||
|
||
- name: parameter_search | ||
description: '{{ doc("table_parameter_search") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_parameter_search_unique | ||
combination_of_columns: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue: I believe the full primary key here is |
||
- year | ||
- run_id | ||
- iteration | ||
- configuration | ||
- fold_id | ||
config: | ||
error_if: ">400" | ||
tags: | ||
- load_auto | ||
|
||
- name: performance | ||
description: '{{ doc("table_performance") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_performance_unique | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
- stage | ||
- triad_code | ||
- geography_type | ||
- geography_id | ||
- class | ||
tags: | ||
- load_auto | ||
|
||
- name: performance_quantile | ||
description: '{{ doc("table_performance_quantile") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_performance_quantile_unique | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
- triad_code | ||
- stage | ||
- geography_type | ||
- geography_id | ||
- class | ||
- num_quantile | ||
- quantile | ||
tags: | ||
- load_auto | ||
|
||
- name: shap | ||
description: '{{ doc("table_shap") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_shap_unique_by_year_run_id_meta_pin_meta_and_card_num | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
- meta_pin | ||
- meta_card_num | ||
config: | ||
error_if: ">524" | ||
tags: | ||
- load_auto | ||
|
||
- name: test_card | ||
description: '{{ doc("table_test_card") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_test_card_unique | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This one has a super high error level. Is there a key that is missing? It seems like it also needs to be unique on sale if there are multiple sales per parcel. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's pretty bizarre. I don't think you're missing any key columns here. It's possible that in the past we had separate test set data frames for the linear vs lgbm model results, but I don't see any column that would separate them. IMO as long as more recent model runs aren't adding to the error count it's fine. |
||
combination_of_columns: | ||
- year | ||
- run_id | ||
- meta_pin | ||
- meta_card_num | ||
- meta_sale_document_num | ||
config: | ||
error_if: ">102422" | ||
tags: | ||
- load_auto | ||
|
||
- name: timing | ||
description: '{{ doc("table_timing") }}' | ||
data_tests: | ||
- unique_combination_of_columns: | ||
name: model_timing_unique_by_year_run_id | ||
combination_of_columns: | ||
- year | ||
- run_id | ||
tags: | ||
- load_auto | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for updating these docs.