Add dbt tests to model databases #686

Damonamajor · 2024-12-20T17:35:19Z

This PR provides DBT tests which fail at a set number of errors. There are three tests which currently fail which should be able to be fixed by deleting the duplicated data.

model_timing: relaxed-Tristan and stupefied-maya
model_parameter_final: clever-kyra
model_feature_importance: sad-tristan

The other tests are set to warnings with a set value. If we upload duplicated data based on the keys, warnings will transition to error messages, and we will know to further debug. These were set a few days ago, and didn't error during the final commit, so it doesn't seem to be a recurring issue.

Keys are updated with two new variables which should cause the data to be unique.

Damonamajor · 2024-12-24T14:56:24Z

dbt/models/model/schema.yml

        tags:
          - load_auto

      - name: test_card
        description: '{{ doc("table_test_card") }}'
+        data_tests:
+          - unique_combination_of_columns:
+              name: model_test_card_unique


This one has a super high error level. Is there a key that is missing? It seems like it also needs to be unique on sale if there are multiple sales per parcel.

That's pretty bizarre. I don't think you're missing any key columns here. It's possible that in the past we had separate test set data frames for the linear vs lgbm model results, but I don't see any column that would separate them. IMO as long as more recent model runs aren't adding to the error count it's fine.

dfsnow

@Damonamajor This looks great. Some super minor changes to make and then this is good to merge.

dfsnow · 2024-12-24T18:41:22Z

dbt/models/model/docs.md

@@ -34,7 +34,7 @@ Overall feature importance by model run (`run_id`).
 Includes metrics such as gain, cover, and frequency. This is the output
 of the built-in LightGBM/XGBoost feature importance methods.

-**Primary Key**: `year`, `run_id`, `model_predictor_name_all`
+**Primary Key**: `year`, `run_id`, `model_predictor_all_name`


Ah, thank you for updating these docs.

dfsnow · 2024-12-24T18:44:21Z

dbt/models/model/docs.md

@@ -188,4 +188,4 @@ View to compile PIN-level model inputs shared between the residential
 (`model.vw_card_res_input`) and condo (`model.vw_pin_condo_input`) model views.

 **Primary Key**: `year`, `run_id`, `meta_pin`


suggestion: None of the model.vw_ views actually have run_id as a component of their primary key. Would be nice to remove it for those views.

dfsnow · 2024-12-24T18:47:52Z

dbt/models/model/schema.yml

+                - meta_year
+                - run_id
+              config:
+                  error_if: ">5748"


suggestion: It would be handy to have a comment here about why we're setting this threshold i.e. their are past errors that we don't intend to fix but we want to prevent future dupes.

dbt/models/model/schema.yml

dfsnow · 2024-12-24T18:50:29Z

dbt/models/model/schema.yml

+        data_tests:
+          - unique_combination_of_columns:
+              name: model_parameter_search_unique_by_year_run_id_and_iteration
+              combination_of_columns:


issue: I believe the full primary key here is year, run_id, iteration, configuration, fold_id. Let's update this test and the docs to reflect that.

dfsnow · 2024-12-24T18:53:37Z

dbt/models/model/schema.yml

+              config:
+                  error_if: ">5748"
+              meta:
+                description: assessment card should be unique by pin, card, year, and run_id


suggestion: I don't think we actually need a description for these tests, since that's mainly used as an input for constructing Excel QC workbooks. Since these tests are exclusive to GitHub Actions (i.e. are only data tests), the description isn't needed.

dfsnow · 2024-12-24T18:58:13Z

dbt/models/model/schema.yml

        tags:
          - load_auto

      - name: test_card
        description: '{{ doc("table_test_card") }}'
+        data_tests:
+          - unique_combination_of_columns:
+              name: model_test_card_unique


That's pretty bizarre. I don't think you're missing any key columns here. It's possible that in the past we had separate test set data frames for the linear vs lgbm model results, but I don't see any column that would separate them. IMO as long as more recent model runs aren't adding to the error count it's fine.

Co-authored-by: Dan Snow <[email protected]>

dfsnow

@Damonamajor Let's add failure thresholds to the last remaining failing tests, then have @jeancochrane take a look at this. After that it should be set to go.

dfsnow · 2024-12-26T21:09:54Z

dbt/models/model/docs.md

@@ -138,7 +138,7 @@ The test set is the out-of-sample data used to evaluate model performance.
 Predictions in this table are trained using only data _not in this set
 of sales_.

-**Primary Key**: `year`, `run_id`, `meta_pin`, `meta_card_num`
+**Primary Key**: `year`, `run_id`, `meta_pin`, `meta_card_num`, `document_number`


nitpick (non-blocking): There's not (yet) a doc number column in this dataset.

This is a variable do we want it?
meta_sale_document_num

Damonamajor · 2024-12-26T21:20:03Z

@Damonamajor Let's add failure thresholds to the last remaining failing tests, then have @jeancochrane take a look at this. After that it should be set to go.

@dfsnow
The remaining failing tests should be easily fixed (with the notes in the original comment). Do we want to fix them or add the thresholds?

Damonamajor added 3 commits December 20, 2024 17:18

Add docs

e53df77

Add schema

2b38351

FIx typos

2ad3de7

Damonamajor linked an issue Dec 24, 2024 that may be closed by this pull request

Add dbt data integrity tests to the model database tables #673

Closed

Damonamajor added 2 commits December 23, 2024 20:08

Update reporting.ratio_stats.py

05b530d

Update reporting.ratio_stats.py

ec4ce57

Damonamajor marked this pull request as ready for review December 24, 2024 02:10

Damonamajor requested a review from a team as a code owner December 24, 2024 02:10

Damonamajor commented Dec 24, 2024

View reviewed changes

Update schema.yml

38d4fb2

dfsnow reviewed Dec 24, 2024

View reviewed changes

Damonamajor and others added 3 commits December 24, 2024 13:03

Update dbt/models/model/schema.yml

ffb2697

Co-authored-by: Dan Snow <[email protected]>

Dan edits

eda20d4

Add document number

bcf8edf

Damonamajor requested a review from dfsnow December 24, 2024 19:59

dfsnow approved these changes Dec 26, 2024

View reviewed changes

Damonamajor and others added 3 commits December 27, 2024 03:31

Change doc_no

bb4823d

Merge branch 'master' into Add-DBT-tests-to-model-databases

c18cc98

Rename and reorganize tests

812c80a

dfsnow merged commit 56ddbd5 into master Dec 27, 2024
7 checks passed

dfsnow deleted the Add-DBT-tests-to-model-databases branch December 27, 2024 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dbt tests to model databases #686

Add dbt tests to model databases #686

Damonamajor commented Dec 20, 2024 •

edited

Loading

Damonamajor Dec 24, 2024 •

edited

Loading

dfsnow Dec 24, 2024

dfsnow left a comment

dfsnow Dec 24, 2024

dfsnow Dec 24, 2024

dfsnow Dec 24, 2024

dfsnow Dec 24, 2024

dfsnow Dec 24, 2024

dfsnow Dec 24, 2024

dfsnow left a comment

dfsnow Dec 26, 2024

Damonamajor Dec 27, 2024

Damonamajor commented Dec 26, 2024 •

edited

Loading

		@@ -188,4 +188,4 @@ View to compile PIN-level model inputs shared between the residential
		(`model.vw_card_res_input`) and condo (`model.vw_pin_condo_input`) model views.

		Primary Key: `year`, `run_id`, `meta_pin`

Add dbt tests to model databases #686

Add dbt tests to model databases #686

Conversation

Damonamajor commented Dec 20, 2024 • edited Loading

Damonamajor Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Damonamajor commented Dec 26, 2024 • edited Loading

Damonamajor commented Dec 20, 2024 •

edited

Loading

Damonamajor Dec 24, 2024 •

edited

Loading

Damonamajor commented Dec 26, 2024 •

edited

Loading