-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add models for exporting daily Commercial QC reference files #678
Add models for exporting daily Commercial QC reference files #678
Conversation
Still todo: * Add docstring to new macro * Format decimals and strings properly
- name: ACRES | ||
index: J | ||
data_type: float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused why basically all of our fields get written to Excel as strings with df.to_excel()
, given that all of these fields have correct types in iasWorld and we're using unload=True
when querying them. Specifying data_type
fixes this issue, but do you have insight into why it might be happening in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is a PyAthena problem. I checked and all the types returned when unload=True
are just Object
i.e. presumably Parquet/Arrow types. openpyxl
seems to convert all Object
dtypes to string by default.
However, it seems like PyAthena allows for a custom type converter (scroll down a bit) that would probably fix this problem. I recommend we try that route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyAthena doesn't seem to work the way I would expect based on the docs. It seems like unload
should perform this type conversion for us, without the need for a custom converter:
If the unload option is enabled, the Parquet file itself has a schema, so the conversion is done to the dtypes according to that schema, and the mappings and types settings of the Converter class are not used.
However, in practice, PandasCursor
with unload=True
seems to always return Object
type. However, ArrowCursor
performs the conversion I would expect (e.g. comdat.ovrrcnld
gets converted to pyarrow.decimal128(10, 0)
). I suspect this may be exposing a bug in the PyAthena implementation, but I don't want to spend too much time isolating it right now.
Most likely due to the automatic conversion that the docs describe, the custom converter also doesn't seem to do anything when set on a cursor with unload=True
. We could switch to a different cursor type, but I do think the pandas cursor with unload=True
is the easiest one to work with in this case. I also don't think a custom converter would substantially reduce our code even if we could get it working, because our Athena types are often ambiguous (e.g. DECIMAL
sometimes indicates a float and sometimes indicates an integer, and VARCHAR
sometimes indicates a decimal stored as a string). We might save a little bit of space on the schema specification, but I don't think it's worth it for this particular application. Definitely open to other opinions though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, let's just go with the existing setup then, as it works and is perfectly clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeancochrane Data type issues aside this is looking great! Nice work. Looking forward to getting these out.
FROM {{ source('iasworld', 'aprval') }} AS aprval | ||
LEFT JOIN {{ source('iasworld', 'pardat') }} AS pardat | ||
ON aprval.parid = pardat.parid | ||
AND CAST(aprval.taxyr AS INT) + 1 = CAST(pardat.taxyr AS INT) | ||
AND pardat.cur = 'Y' | ||
AND pardat.deactivat IS NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I don't think it changes the results here, but to me it makes more sense to use pardat
as the main source table (after FROM
) since that's the source of your taxyr
and parid
columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, the output record counts in the view are quite different depending on whether you use pardat
or aprval
as the left side of the join (7,345,285 for pardat
vs. 6,980,263 for aprval
). Perhaps this is a question to raise with this view's stakeholders, but for now I chose aprval
as the left side of the join to exactly match the ias query that currently produces these data.
- name: ACRES | ||
index: J | ||
data_type: float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is a PyAthena problem. I checked and all the types returned when unload=True
are just Object
i.e. presumably Parquet/Arrow types. openpyxl
seems to convert all Object
dtypes to string by default.
However, it seems like PyAthena allows for a custom type converter (scroll down a bit) that would probably fix this problem. I recommend we try that route.
… IC com bldg detail
@dfsnow All your edits are in, let me know if this is ready to go! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeancochrane Sorry for the long delay! This looks great to me. All set.
elif data_type == "decimal": | ||
type_func = decimal.Decimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (non-blocking): Just out of curiousity, what's the deal with the decimal type here? Does it maintain the fixed precision of the Parquet/spark types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much! We have to process a couple of fields that are stored as strings in iasWorld, but actually represent arbitrary numerics. The Python decimal.Decimal
type should let us handle those columns gracefully, without the pitfalls of both int
(which will truncate any decimals) and float
(which will add decimals to ints). Annoying that we have to do this, but at least it works.
I snuck in one quick change before merging: 37b49ad passes the |
This PR completes the first task in #661 by creating dbt models corresponding to the reference files that the Commercial team exports daily during desk review.
A subsequent PR will fix a data type bug in https://github.com/ccao-data/service-spark-iasworld that is causing slight discrepancies in values in four columns in the land details sheet. Once both PRs have landed, we will be ready to schedule a daily process on the VM.