Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add models for exporting daily Commercial QC reference files #678

Merged

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Dec 10, 2024

This PR completes the first task in #661 by creating dbt models corresponding to the reference files that the Commercial team exports daily during desk review.

A subsequent PR will fix a data type bug in https://github.com/ccao-data/service-spark-iasworld that is causing slight discrepancies in values in four columns in the land details sheet. Once both PRs have landed, we will be ready to schedule a daily process on the VM.

@jeancochrane jeancochrane linked an issue Dec 10, 2024 that may be closed by this pull request
3 tasks
@jeancochrane jeancochrane changed the title Jeancochrane/661 add qc views for daily commercial queries Add models for exporting daily Commercial QC reference files Dec 10, 2024
dbt/macros/.sqlfluff Show resolved Hide resolved
dbt/macros/insert_hyphens.sql Show resolved Hide resolved
Comment on lines +236 to +238
- name: ACRES
index: J
data_type: float
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused why basically all of our fields get written to Excel as strings with df.to_excel(), given that all of these fields have correct types in iasWorld and we're using unload=True when querying them. Specifying data_type fixes this issue, but do you have insight into why it might be happening in the first place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is a PyAthena problem. I checked and all the types returned when unload=True are just Object i.e. presumably Parquet/Arrow types. openpyxl seems to convert all Object dtypes to string by default.

However, it seems like PyAthena allows for a custom type converter (scroll down a bit) that would probably fix this problem. I recommend we try that route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyAthena doesn't seem to work the way I would expect based on the docs. It seems like unload should perform this type conversion for us, without the need for a custom converter:

If the unload option is enabled, the Parquet file itself has a schema, so the conversion is done to the dtypes according to that schema, and the mappings and types settings of the Converter class are not used.

However, in practice, PandasCursor with unload=True seems to always return Object type. However, ArrowCursor performs the conversion I would expect (e.g. comdat.ovrrcnld gets converted to pyarrow.decimal128(10, 0)). I suspect this may be exposing a bug in the PyAthena implementation, but I don't want to spend too much time isolating it right now.

Most likely due to the automatic conversion that the docs describe, the custom converter also doesn't seem to do anything when set on a cursor with unload=True. We could switch to a different cursor type, but I do think the pandas cursor with unload=True is the easiest one to work with in this case. I also don't think a custom converter would substantially reduce our code even if we could get it working, because our Athena types are often ambiguous (e.g. DECIMAL sometimes indicates a float and sometimes indicates an integer, and VARCHAR sometimes indicates a decimal stored as a string). We might save a little bit of space on the schema specification, but I don't think it's worth it for this particular application. Definitely open to other opinions though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, let's just go with the existing setup then, as it works and is perfectly clear.

dbt/models/qc/schema.yml Show resolved Hide resolved
dbt/seeds/ccao/ccao.commercial_subclass_1.csv Outdated Show resolved Hide resolved
dbt/seeds/ccao/docs.md Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeancochrane Data type issues aside this is looking great! Nice work. Looking forward to getting these out.

dbt/macros/insert_hyphens.sql Show resolved Hide resolved
dbt/macros/tests/test_insert_hyphens.sql Show resolved Hide resolved
dbt/macros/tests/test_insert_hyphens.sql Outdated Show resolved Hide resolved
Comment on lines +22 to +27
FROM {{ source('iasworld', 'aprval') }} AS aprval
LEFT JOIN {{ source('iasworld', 'pardat') }} AS pardat
ON aprval.parid = pardat.parid
AND CAST(aprval.taxyr AS INT) + 1 = CAST(pardat.taxyr AS INT)
AND pardat.cur = 'Y'
AND pardat.deactivat IS NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I don't think it changes the results here, but to me it makes more sense to use pardat as the main source table (after FROM) since that's the source of your taxyr and parid columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the output record counts in the view are quite different depending on whether you use pardat or aprval as the left side of the join (7,345,285 for pardat vs. 6,980,263 for aprval). Perhaps this is a question to raise with this view's stakeholders, but for now I chose aprval as the left side of the join to exactly match the ias query that currently produces these data.

dbt/scripts/export_models.py Show resolved Hide resolved
Comment on lines +236 to +238
- name: ACRES
index: J
data_type: float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is a PyAthena problem. I checked and all the types returned when unload=True are just Object i.e. presumably Parquet/Arrow types. openpyxl seems to convert all Object dtypes to string by default.

However, it seems like PyAthena allows for a custom type converter (scroll down a bit) that would probably fix this problem. I recommend we try that route.

dbt/seeds/ccao/ccao.commercial_subclass_1.csv Outdated Show resolved Hide resolved
dbt/models/qc/schema.yml Show resolved Hide resolved
@jeancochrane jeancochrane requested a review from dfsnow December 19, 2024 21:02
@jeancochrane
Copy link
Contributor Author

@dfsnow All your edits are in, let me know if this is ready to go!

Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeancochrane Sorry for the long delay! This looks great to me. All set.

Comment on lines +369 to +370
elif data_type == "decimal":
type_func = decimal.Decimal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): Just out of curiousity, what's the deal with the decimal type here? Does it maintain the fixed precision of the Parquet/spark types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much! We have to process a couple of fields that are stored as strings in iasWorld, but actually represent arbitrary numerics. The Python decimal.Decimal type should let us handle those columns gracefully, without the pitfalls of both int (which will truncate any decimals) and float (which will add decimals to ints). Annoying that we have to do this, but at least it works.

@jeancochrane
Copy link
Contributor Author

I snuck in one quick change before merging: 37b49ad passes the target argument through from export_models to the helper function query_models_for_export. This is how export_models was always supposed to work, to the extent that the query_models_for_export function expected a target kwarg even before this change; I think I just forgot to pass the arg through in #626.

@jeancochrane jeancochrane merged commit 8142ab7 into master Dec 26, 2024
8 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/661-add-qc-views-for-daily-commercial-queries branch December 26, 2024 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add QC views for daily commercial queries
2 participants