-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scripts/export.py
script for exporting flags to iasWorld
#114
Add scripts/export.py
script for exporting flags to iasWorld
#114
Conversation
# Run some data integrity checks | ||
not_null_fields = [PIN_FIELD, SALE_KEY_FIELD, RUN_ID_FIELD] | ||
for field in not_null_fields: | ||
assert flag_df[flag_df[field].isnull()].empty, f"{field} contains nulls" | ||
|
||
assert flag_df[ | ||
~flag_df[OUTLIER_TYPE_FIELD].isin(OUTLIER_TYPE_CODES.values()) | ||
].empty, f"{OUTLIER_TYPE_FIELD} contains invalid codes" | ||
|
||
assert flag_df[ | ||
(flag_df[IS_OUTLIER_FIELD] == "Y") | ||
& (flag_df[OUTLIER_TYPE_FIELD] == OUTLIER_TYPE_CODES["Not outlier"]) | ||
].empty, ( | ||
f"{OUTLIER_TYPE_FIELD} cannot be {OUTLIER_TYPE_CODES['Not outlier']} " | ||
f"when {IS_OUTLIER_FIELD} is Y" | ||
) | ||
|
||
assert flag_df[ | ||
(flag_df[IS_OUTLIER_FIELD] == "N") | ||
& (flag_df[OUTLIER_TYPE_FIELD] != OUTLIER_TYPE_CODES["Not outlier"]) | ||
].empty, ( | ||
f"{OUTLIER_TYPE_FIELD} must be {OUTLIER_TYPE_CODES['Not outlier']} " | ||
f"when {IS_OUTLIER_FIELD} is N" | ||
) | ||
|
||
assert ( | ||
num_flags == expected_num_flags | ||
), f"Expected {expected_num_flags} total sales, got {num_flags}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point it would be nice to copy these tests over to dbt to test the iasWorld flag table, but whether or not we decide to do that, I still think it'll be useful to have some guardrails here that will warn us if we accidentally introduce a bad join that clobbers the export data in some unforeseen way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: Agree we should move it over, but this is great. Would like to see more defensive coding like this elsewhere in our repos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeancochrane This is great! Thanks for the quick turnaround. To answer your questions:
Should this script live here in model-sales-val, or instead in https://github.com/ccao-data/data-architecture? On the one hand, it largely references data produced by this repo, but on the other hand it represents a type of export that is currently uncommon but that we expect to become more common across our data warehouse in the future. From a code perspective, there's no reason why it needs to live in one repo or the other, but once we start doing exports like this more regularly I wonder if we'll want them all to live in one place.
IMO it should probably live data-architecture
long-term. I imagine most of our "moving data around" scripts will live there.
What sorts of flags should the script accept for filtering the flag output? And should we implement those now, or wait until we know more about our export cadence? I think it makes sense to wait, but I'm curious what you think in general. Some ideas:
Filter by one or more run IDs, so that e.g. we can generate a new run and then only export the flags it created
Filter by run date, so we can e.g. export all flags created after a certain date
Any others?
The main one I expect will be something like "export flags for sales without a flag in iasWorld" or "export flags that overwrite this existing run". But we can wait on the implementation until we know a bit more after next week.
# Run some data integrity checks | ||
not_null_fields = [PIN_FIELD, SALE_KEY_FIELD, RUN_ID_FIELD] | ||
for field in not_null_fields: | ||
assert flag_df[flag_df[field].isnull()].empty, f"{field} contains nulls" | ||
|
||
assert flag_df[ | ||
~flag_df[OUTLIER_TYPE_FIELD].isin(OUTLIER_TYPE_CODES.values()) | ||
].empty, f"{OUTLIER_TYPE_FIELD} contains invalid codes" | ||
|
||
assert flag_df[ | ||
(flag_df[IS_OUTLIER_FIELD] == "Y") | ||
& (flag_df[OUTLIER_TYPE_FIELD] == OUTLIER_TYPE_CODES["Not outlier"]) | ||
].empty, ( | ||
f"{OUTLIER_TYPE_FIELD} cannot be {OUTLIER_TYPE_CODES['Not outlier']} " | ||
f"when {IS_OUTLIER_FIELD} is Y" | ||
) | ||
|
||
assert flag_df[ | ||
(flag_df[IS_OUTLIER_FIELD] == "N") | ||
& (flag_df[OUTLIER_TYPE_FIELD] != OUTLIER_TYPE_CODES["Not outlier"]) | ||
].empty, ( | ||
f"{OUTLIER_TYPE_FIELD} must be {OUTLIER_TYPE_CODES['Not outlier']} " | ||
f"when {IS_OUTLIER_FIELD} is N" | ||
) | ||
|
||
assert ( | ||
num_flags == expected_num_flags | ||
), f"Expected {expected_num_flags} total sales, got {num_flags}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: Agree we should move it over, but this is great. Would like to see more defensive coding like this elsewhere in our repos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
That makes sense to me @dfsnow -- I'll merge the script here so we can start using it immediately and then open up an issue to move it to |
This PR adds a new script to the repo allowing a caller to export flags to iasWorld for upload.
Testing
To test, check out the code and run the script:
You may have to run
aws-mfa
if you haven't already today.Questions
A couple of big-picture design questions about this code:
model-sales-val
, or instead in https://github.com/ccao-data/data-architecture? On the one hand, it largely references data produced by this repo, but on the other hand it represents a type of export that is currently uncommon but that we expect to become more common across our data warehouse in the future. From a code perspective, there's no reason why it needs to live in one repo or the other, but once we start doing exports like this more regularly I wonder if we'll want them all to live in one place.Closes #113.