Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dbt tests to model databases #686

Merged
merged 12 commits into from
Dec 27, 2024
18 changes: 9 additions & 9 deletions dbt/models/model/docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Overall feature importance by model run (`run_id`).
Includes metrics such as gain, cover, and frequency. This is the output
of the built-in LightGBM/XGBoost feature importance methods.

**Primary Key**: `year`, `run_id`, `model_predictor_name_all`
**Primary Key**: `year`, `run_id`, `model_predictor_all_name`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you for updating these docs.

{% enddocs %}

# final_model
Expand Down Expand Up @@ -77,7 +77,7 @@ If hyperparameters are blank for a given run, then that parameter was not used.
Range of hyperparameters searched by a given model run (`run_id`)
during cross-validation.

**Primary Key**: `year`, `run_id`
**Primary Key**: `year`, `run_id`, `parameter_name`
{% enddocs %}

# parameter_search
Expand All @@ -86,7 +86,7 @@ during cross-validation.
Hyperparameters used for _every_ cross-validation iteration, along with
the corresponding performance statistics.

**Primary Key**: `year`, `run_id`, `iteration`
**Primary Key**: `year`, `run_id`, `iteration`, `configuration`, `fold_id`
{% enddocs %}

# performance
Expand All @@ -113,7 +113,7 @@ The stages are:
Identical to `model.performance`, but additionally broken out by quantile.

**Primary Key**: `year`, `run_id`, `stage`, `triad_code`, `geography_type`,
`geography_id`, `by_class`, `quantile`
`geography_id`, `by_class`, `num_quantile`, `quantile`
{% enddocs %}

# shap
Expand All @@ -138,7 +138,7 @@ The test set is the out-of-sample data used to evaluate model performance.
Predictions in this table are trained using only data _not in this set
of sales_.

**Primary Key**: `year`, `run_id`, `meta_pin`, `meta_card_num`
**Primary Key**: `year`, `run_id`, `meta_pin`, `meta_card_num`, `document_number`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (non-blocking): There's not (yet) a doc number column in this dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a variable do we want it?
meta_sale_document_num

{% enddocs %}

# timing
Expand All @@ -165,7 +165,7 @@ data cached by DVC when possible. See
[model-res-avm#getting-data](https://github.com/ccao-data/model-res-avm#getting-data)
for more information.

**Primary Key**: `year`, `run_id`, `meta_pin`, `meta_card_num`
**Primary Key**: `year`, `meta_pin`, `meta_card_num`
{% enddocs %}

# vw_pin_condo_input
Expand All @@ -178,7 +178,7 @@ Observations are at the PIN-14 (condo unit) level. Unlike the residential
input view, this view does not perform filling. Instead condo characteristics
are backfilled in `default.vw_pin_condo_char`.

**Primary Key**: `year`, `run_id`, `meta_pin`
**Primary Key**: `year`, `meta_pin`
{% enddocs %}

# vw_pin_shared_input
Expand All @@ -187,5 +187,5 @@ are backfilled in `default.vw_pin_condo_char`.
View to compile PIN-level model inputs shared between the residential
(`model.vw_card_res_input`) and condo (`model.vw_pin_condo_input`) model views.

**Primary Key**: `year`, `run_id`, `meta_pin`
{% enddocs %}
**Primary Key**: `year`, `meta_pin`
{% enddocs %}
110 changes: 110 additions & 0 deletions dbt/models/model/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,61 +4,171 @@ sources:
tables:
- name: assessment_card
description: '{{ doc("table_assessment_card") }}'
data_tests:
- unique_combination_of_columns:
name: model_assessment_card_unique_by_pin_card_and_year
combination_of_columns:
- meta_pin
- meta_card_num
- meta_year
- run_id
config:
# We add a fixed level of errors since duplicated data exists before
# these tests were implemented. If duplicated data is added after 12/24/2024,
# warnings will transition to errors.
error_if: ">5748"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: It would be handy to have a comment here about why we're setting this threshold i.e. their are past errors that we don't intend to fix but we want to prevent future dupes.

tags:
- load_auto

- name: assessment_pin
description: '{{ doc("table_assessment_pin") }}'
data_tests:
- unique_combination_of_columns:
name: model_assessment_pin_unique_by_pin_year_and_run_id
combination_of_columns:
- meta_pin
- meta_year
- run_id
config:
error_if: ">2016"
tags:
- load_auto

- name: feature_importance
data_tests:
- unique_combination_of_columns:
name: model_feature_importance_unique
combination_of_columns:
- year
- run_id
- model_predictor_all_name
description: '{{ doc("table_feature_importance") }}'
tags:
- load_auto

- name: metadata
data_tests:
- unique_combination_of_columns:
name: model_metadata_unique_by_year_and_run_id
combination_of_columns:
- year
- run_id
description: '{{ doc("table_metadata") }}'
tags:
- load_auto

- name: parameter_final
description: '{{ doc("table_parameter_final") }}'
data_tests:
- unique_combination_of_columns:
name: model_parameter_final_unique_by_year_and_run_id
combination_of_columns:
- year
- run_id
tags:
- load_auto

- name: parameter_range
description: '{{ doc("table_parameter_range") }}'
data_tests:
- unique_combination_of_columns:
name: model_parameter_range_unique_by_year_run_id_and_parameter_name
combination_of_columns:
- year
- run_id
- parameter_name
tags:
- load_auto

- name: parameter_search
description: '{{ doc("table_parameter_search") }}'
data_tests:
- unique_combination_of_columns:
name: model_parameter_search_unique
combination_of_columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: I believe the full primary key here is year, run_id, iteration, configuration, fold_id. Let's update this test and the docs to reflect that.

- year
- run_id
- iteration
- configuration
- fold_id
config:
error_if: ">400"
tags:
- load_auto

- name: performance
description: '{{ doc("table_performance") }}'
data_tests:
- unique_combination_of_columns:
name: model_performance_unique
combination_of_columns:
- year
- run_id
- stage
- triad_code
- geography_type
- geography_id
- class
tags:
- load_auto

- name: performance_quantile
description: '{{ doc("table_performance_quantile") }}'
data_tests:
- unique_combination_of_columns:
name: model_performance_quantile_unique
combination_of_columns:
- year
- run_id
- triad_code
- stage
- geography_type
- geography_id
- class
- num_quantile
- quantile
tags:
- load_auto

- name: shap
description: '{{ doc("table_shap") }}'
data_tests:
- unique_combination_of_columns:
name: model_shap_unique_by_year_run_id_meta_pin_meta_and_card_num
combination_of_columns:
- year
- run_id
- meta_pin
- meta_card_num
config:
error_if: ">524"
tags:
- load_auto

- name: test_card
description: '{{ doc("table_test_card") }}'
data_tests:
- unique_combination_of_columns:
name: model_test_card_unique
Copy link
Contributor Author

@Damonamajor Damonamajor Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one has a super high error level. Is there a key that is missing? It seems like it also needs to be unique on sale if there are multiple sales per parcel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty bizarre. I don't think you're missing any key columns here. It's possible that in the past we had separate test set data frames for the linear vs lgbm model results, but I don't see any column that would separate them. IMO as long as more recent model runs aren't adding to the error count it's fine.

combination_of_columns:
- year
- run_id
- meta_pin
- meta_card_num
- meta_sale_document_num
config:
error_if: ">102422"
tags:
- load_auto

- name: timing
description: '{{ doc("table_timing") }}'
data_tests:
- unique_combination_of_columns:
name: model_timing_unique_by_year_run_id
combination_of_columns:
- year
- run_id
tags:
- load_auto

Expand Down
Loading