Add models for exporting daily Commercial QC reference files #678

jeancochrane · 2024-12-10T22:28:37Z

This PR completes the first task in #661 by creating dbt models corresponding to the reference files that the Commercial team exports daily during desk review.

A subsequent PR will fix a data type bug in https://github.com/ccao-data/service-spark-iasworld that is causing slight discrepancies in values in four columns in the land details sheet. Once both PRs have landed, we will be ready to schedule a daily process on the VM.

Still todo: * Add docstring to new macro * Format decimals and strings properly

…dg_detail view

…r_year_vals

dbt/macros/.sqlfluff

dbt/macros/insert_hyphens.sql

dbt/models/qc/qc.vw_ic_reference_all_non_res_land_details.sql

jeancochrane · 2024-12-11T15:49:00Z

dbt/models/qc/schema.yml

+          - name: ACRES
+            index: J
+            data_type: float


I'm a little confused why basically all of our fields get written to Excel as strings with df.to_excel(), given that all of these fields have correct types in iasWorld and we're using unload=True when querying them. Specifying data_type fixes this issue, but do you have insight into why it might be happening in the first place?

It looks like this is a PyAthena problem. I checked and all the types returned when unload=True are just Object i.e. presumably Parquet/Arrow types. openpyxl seems to convert all Object dtypes to string by default.

However, it seems like PyAthena allows for a custom type converter (scroll down a bit) that would probably fix this problem. I recommend we try that route.

PyAthena doesn't seem to work the way I would expect based on the docs. It seems like unload should perform this type conversion for us, without the need for a custom converter:

If the unload option is enabled, the Parquet file itself has a schema, so the conversion is done to the dtypes according to that schema, and the mappings and types settings of the Converter class are not used.

However, in practice, PandasCursor with unload=True seems to always return Object type. However, ArrowCursor performs the conversion I would expect (e.g. comdat.ovrrcnld gets converted to pyarrow.decimal128(10, 0)). I suspect this may be exposing a bug in the PyAthena implementation, but I don't want to spend too much time isolating it right now.

Most likely due to the automatic conversion that the docs describe, the custom converter also doesn't seem to do anything when set on a cursor with unload=True. We could switch to a different cursor type, but I do think the pandas cursor with unload=True is the easiest one to work with in this case. I also don't think a custom converter would substantially reduce our code even if we could get it working, because our Athena types are often ambiguous (e.g. DECIMAL sometimes indicates a float and sometimes indicates an integer, and VARCHAR sometimes indicates a decimal stored as a string). We might save a little bit of space on the schema specification, but I don't think it's worth it for this particular application. Definitely open to other opinions though!

Alright, let's just go with the existing setup then, as it works and is perfectly clear.

dbt/models/qc/schema.yml

dbt/seeds/ccao/ccao.commercial_subclass_1.csv

dbt/seeds/ccao/docs.md

pyproject.toml

dfsnow

@jeancochrane Data type issues aside this is looking great! Nice work. Looking forward to getting these out.

dbt/macros/insert_hyphens.sql

dbt/macros/tests/test_insert_hyphens.sql

dbt/models/qc/qc.vw_ic_reference_all_non_res_land_details.sql

dfsnow · 2024-12-18T22:21:21Z

dbt/models/qc/qc.vw_ic_reference_all_non_res_pin_level_prior_year_vals.sql

+FROM {{ source('iasworld', 'aprval') }} AS aprval
+LEFT JOIN {{ source('iasworld', 'pardat') }} AS pardat
+    ON aprval.parid = pardat.parid
+    AND CAST(aprval.taxyr AS INT) + 1 = CAST(pardat.taxyr AS INT)
+    AND pardat.cur = 'Y'
+    AND pardat.deactivat IS NULL


nitpick: I don't think it changes the results here, but to me it makes more sense to use pardat as the main source table (after FROM) since that's the source of your taxyr and parid columns.

Unfortunately, the output record counts in the view are quite different depending on whether you use pardat or aprval as the left side of the join (7,345,285 for pardat vs. 6,980,263 for aprval). Perhaps this is a question to raise with this view's stakeholders, but for now I chose aprval as the left side of the join to exactly match the ias query that currently produces these data.

dbt/scripts/export_models.py

dfsnow · 2024-12-18T23:03:05Z

dbt/models/qc/schema.yml

+          - name: ACRES
+            index: J
+            data_type: float


It looks like this is a PyAthena problem. I checked and all the types returned when unload=True are just Object i.e. presumably Parquet/Arrow types. openpyxl seems to convert all Object dtypes to string by default.

However, it seems like PyAthena allows for a custom type converter (scroll down a bit) that would probably fix this problem. I recommend we try that route.

dbt/seeds/ccao/ccao.commercial_subclass_1.csv

dbt/models/qc/schema.yml

…_subclass`

… IC com bldg detail

jeancochrane · 2024-12-19T21:03:03Z

@dfsnow All your edits are in, let me know if this is ready to go!

dfsnow

@jeancochrane Sorry for the long delay! This looks great to me. All set.

dfsnow · 2024-12-24T19:49:01Z

dbt/scripts/utils/export.py

+                        elif data_type == "decimal":
+                            type_func = decimal.Decimal


question (non-blocking): Just out of curiousity, what's the deal with the decimal type here? Does it maintain the fixed precision of the Parquet/spark types?

Pretty much! We have to process a couple of fields that are stored as strings in iasWorld, but actually represent arbitrary numerics. The Python decimal.Decimal type should let us handle those columns gracefully, without the pitfalls of both int (which will truncate any decimals) and float (which will add decimals to ints). Annoying that we have to do this, but at least it works.

jeancochrane · 2024-12-26T21:41:11Z

I snuck in one quick change before merging: 37b49ad passes the target argument through from export_models to the helper function query_models_for_export. This is how export_models was always supposed to work, to the extent that the query_models_for_export function expected a target kwarg even before this change; I think I just forgot to pass the arg through in #626.

jeancochrane and others added 15 commits December 3, 2024 17:19

WIP add IC reference file query for non-res land details

3c6b707

Still todo: * Add docstring to new macro * Format decimals and strings properly

Fix formatting for insert_hyphens macro and tests

4c9d5a4

Finish definition of qc.vw_ic_reference_all_non_res_land_details

2e99e99

Fix typo in insert_macros mock in pyproject.toml

93accb2

Add qc.vw_ic_reference_all_non_res_pin_level_prior_year_vals

051c5b0

Add qc.vw_ic_reference_all_non_res_pin_level_data

49befd8

Add commercial subclass seeds and qc.vw_ic_reference_all_towns_com_bl…

229a08e

…dg_detail view

Add qc.vw_ic_reference_all_towns_oby_detail

fa6c951

Fix float data types for IC QC reference exports

7682b88

Tweak export_models template interface to support IC QC reference files

0f551c4

Add export template for qc.vw_ic_reference_all_non_res_pin_level_data

85d8020

Fix data types for qc.vw_ic_reference_all_non_res_land_details

41ba06c

Add export template for qc.vw_ic_reference_all_non_res_pin_level_prio…

6dd1da1

…r_year_vals

Add export template for qc.vw_ic_reference_all_towns_com_bldg_detail

725d34e

Add export template for qc.vw_ic_reference_all_towns_oby_detail

ed532b2

jeancochrane linked an issue Dec 10, 2024 that may be closed by this pull request

Add QC views for daily commercial queries #661

Closed

3 tasks

jeancochrane changed the title ~~Jeancochrane/661 add qc views for daily commercial queries~~ Add models for exporting daily Commercial QC reference files Dec 10, 2024

Fix type error in export.py

0dbc020

jeancochrane commented Dec 11, 2024

View reviewed changes

jeancochrane marked this pull request as ready for review December 11, 2024 15:53

jeancochrane requested a review from a team as a code owner December 11, 2024 15:53

jeancochrane requested a review from dfsnow December 11, 2024 15:53

jeancochrane mentioned this pull request Dec 13, 2024

Add schema overrides for land rate columns ccao-data/service-spark-iasworld#26

Merged

dfsnow reviewed Dec 19, 2024

View reviewed changes

jeancochrane added 6 commits December 19, 2024 11:26

Stricter argument validation in insert_hyphens

7976abd

Switch single quotes to double quotes in all insert_hyphens calls

48a9558

Rename commercial_subclass_{num} seeds to `commercial_{major|minor}…

a6fffc8

…_subclass`

Support decimal types in export_models and use it on comdat.user28 in…

f70f3d7

… IC com bldg detail

Use decimal for Held Market Value in OBY IC reference file

af19c11

Update IC reference file export names to match latest versions

ff8a774

jeancochrane requested a review from dfsnow December 19, 2024 21:02

dfsnow approved these changes Dec 24, 2024

View reviewed changes

Pass target arg through to query_models_for_export in export_models

37b49ad

jeancochrane merged commit 8142ab7 into master Dec 26, 2024
8 checks passed

jeancochrane deleted the jeancochrane/661-add-qc-views-for-daily-commercial-queries branch December 26, 2024 21:41

jeancochrane mentioned this pull request Dec 27, 2024

Fix minor_subclass join in qc.vw_ic_reference_all_towns_com_bldg_detail #692

Merged

jeancochrane mentioned this pull request Jan 15, 2025

Update IC reference files to match format for missing data and sheet names with Inquire #709

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add models for exporting daily Commercial QC reference files #678

Add models for exporting daily Commercial QC reference files #678

jeancochrane commented Dec 10, 2024 •

edited

Loading

jeancochrane Dec 11, 2024

dfsnow Dec 18, 2024

jeancochrane Dec 19, 2024

dfsnow Dec 19, 2024

dfsnow left a comment

dfsnow Dec 18, 2024

jeancochrane Dec 19, 2024

dfsnow Dec 18, 2024

jeancochrane commented Dec 19, 2024

dfsnow left a comment

dfsnow Dec 24, 2024

jeancochrane Dec 26, 2024

jeancochrane commented Dec 26, 2024

Add models for exporting daily Commercial QC reference files #678

Add models for exporting daily Commercial QC reference files #678

Conversation

jeancochrane commented Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented Dec 19, 2024

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented Dec 26, 2024

jeancochrane commented Dec 10, 2024 •

edited

Loading