Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dbt tests to model databases #686

Merged
merged 12 commits into from
Dec 27, 2024
8 changes: 4 additions & 4 deletions dbt/models/model/docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Overall feature importance by model run (`run_id`).
Includes metrics such as gain, cover, and frequency. This is the output
of the built-in LightGBM/XGBoost feature importance methods.

**Primary Key**: `year`, `run_id`, `model_predictor_name_all`
**Primary Key**: `year`, `run_id`, `model_predictor_all_name`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you for updating these docs.

{% enddocs %}

# final_model
Expand Down Expand Up @@ -77,7 +77,7 @@ If hyperparameters are blank for a given run, then that parameter was not used.
Range of hyperparameters searched by a given model run (`run_id`)
during cross-validation.

**Primary Key**: `year`, `run_id`
**Primary Key**: `year`, `run_id`, `parameter_name`
{% enddocs %}

# parameter_search
Expand Down Expand Up @@ -113,7 +113,7 @@ The stages are:
Identical to `model.performance`, but additionally broken out by quantile.

**Primary Key**: `year`, `run_id`, `stage`, `triad_code`, `geography_type`,
`geography_id`, `by_class`, `quantile`
`geography_id`, `by_class`, `num_quantile`, `quantile`
{% enddocs %}

# shap
Expand Down Expand Up @@ -188,4 +188,4 @@ View to compile PIN-level model inputs shared between the residential
(`model.vw_card_res_input`) and condo (`model.vw_pin_condo_input`) model views.

**Primary Key**: `year`, `run_id`, `meta_pin`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: None of the model.vw_ views actually have run_id as a component of their primary key. Would be nice to remove it for those views.

{% enddocs %}
{% enddocs %}
133 changes: 132 additions & 1 deletion dbt/models/model/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,61 +4,192 @@ sources:
tables:
- name: assessment_card
description: '{{ doc("table_assessment_card") }}'
data_tests:
- unique_combination_of_columns:
name: model_assessment_card_unique_by_pin_card_and_year
combination_of_columns:
- meta_pin
- meta_card_num
- meta_year
- run_id
config:
error_if: ">5748"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: It would be handy to have a comment here about why we're setting this threshold i.e. their are past errors that we don't intend to fix but we want to prevent future dupes.

meta:
description: assessment card should be unique by pin, card, year, and run_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I don't think we actually need a description for these tests, since that's mainly used as an input for constructing Excel QC workbooks. Since these tests are exclusive to GitHub Actions (i.e. are only data tests), the description isn't needed.

tags:
- load_auto

- name: assessment_pin
description: '{{ doc("table_assessment_pin") }}'
data_tests:
- unique_combination_of_columns:
name: model_assessment_pin_unique_by_pin_and_year
combination_of_columns:
- meta_pin
- meta_year
- run_id
config:
error_if: ">2016"
meta:
description: assessment pin should be unique by pin, year, and run_id
tags:
- load_auto

- name: feature_importance
data_tests:
- unique_combination_of_columns:
name: model_feature_importance_unique
combination_of_columns:
- year
- run_id
- model_predictor_all_name
meta:
description: feature importance should be unique by year, run_id, and model_predictor_all_name
description: '{{ doc("table_feature_importance") }}'
tags:
- load_auto

- name: metadata
data_tests:
- unique_combination_of_columns:
name: model_metadata_unique_by_year_and_run_id
combination_of_columns:
- year
- run_id
meta:
description: metadata should be unique by year and run_id
description: '{{ doc("table_metadata") }}'
tags:
- load_auto

- name: parameter_final
description: '{{ doc("table_parameter_final") }}'
data_tests:
- unique_combination_of_columns:
name: model_parameter_final_unique_by_year_and_run_id
combination_of_columns:
- year
- run_id
meta:
description: parameter final should be unique by year and run_id
tags:
- load_auto

- name: parameter_range
description: '{{ doc("table_parameter_range") }}'
data_tests:
- unique_combination_of_columns:
name: model_parameter_range_unique_by_year_run_id_and_parameter_name
combination_of_columns:
- year
- run_id
- parameter_name
meta:
description: parameter range should be unique by year and run_id
tags:
- load_auto

- name: parameter_search
description: '{{ doc("table_parameter_search") }}'
data_tests:
- unique_combination_of_columns:
name: model_parameter_search_unique_by_year_run_id_and_iteration
combination_of_columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: I believe the full primary key here is year, run_id, iteration, configuration, fold_id. Let's update this test and the docs to reflect that.

- year
- run_id
- iteration
config:
error_if: ">2136"
meta:
description: parameter search should be unique by year, run_id, and iteration
tags:
- load_auto

- name: performance
description: '{{ doc("table_performance") }}'
data_tests:
- unique_combination_of_columns:
name: model_performance_unique
combination_of_columns:
- year
- run_id
- stage
- triad_code
- geography_type
- geography_id
- class
meta:
description: performance should be unique by year, run_id, stage, triad_code, geography_type, geography_id, and class
tags:
- load_auto

- name: performance_quantile
description: '{{ doc("table_performance_quantile") }}'
data_tests:
- unique_combination_of_columns:
name: model_performance_quantile_unique
combination_of_columns:
- year
- run_id
- triad_code
- stage
- geography_type
- geography_id
- class
- num_quantile
- quantile
meta:
description: >
performance quantile should be unique by year, run_id, stage, triad_code,
geography_type, by_class, geography_id, num_quantile, and quantile
tags:
- load_auto

- name: shap
description: '{{ doc("table_shap") }}'
data_tests:
- unique_combination_of_columns:
name: model_shap_unique_by_year_run_id_meta_pin_meta_and_card_num
combination_of_columns:
- year
- run_id
- meta_pin
- meta_card_num
config:
error_if: ">524"
meta:
description: shap should be unique by year, run_id, meta_pin, and meta_card_num
tags:
- load_auto

- name: test_card
description: '{{ doc("table_test_card") }}'
data_tests:
- unique_combination_of_columns:
name: model_test_card_unique
Copy link
Contributor Author

@Damonamajor Damonamajor Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one has a super high error level. Is there a key that is missing? It seems like it also needs to be unique on sale if there are multiple sales per parcel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty bizarre. I don't think you're missing any key columns here. It's possible that in the past we had separate test set data frames for the linear vs lgbm model results, but I don't see any column that would separate them. IMO as long as more recent model runs aren't adding to the error count it's fine.

combination_of_columns:
- year
- run_id
- meta_pin
- meta_card_num
- meta_sale_document_num
config:
error_if: ">102422"
meta:
description: test card should be unique by year, run_id, meta_pin, meta_card_num, and meta_sale_document_num
tags:
- load_auto

- name: timing
description: '{{ doc("table_timing") }}'
data_tests:
- unique_combination_of_columns:
name: model_timing_unique_by_year_run_id
combination_of_columns:
- year
- run_id
meta:
description: timing should be unique by year and run_id
tags:
- load_auto

Expand Down Expand Up @@ -996,4 +1127,4 @@ models:
name: model_vw_pin_condo_input_unique_pin_year
combination_of_columns:
- meta_pin
- meta_year
- meta_year
Loading