Add `scripts/export.py` script for exporting flags to iasWorld #114

jeancochrane · 2024-03-27T17:26:56Z

This PR adds a new script to the repo allowing a caller to export flags to iasWorld for upload.

Testing

To test, check out the code and run the script:

python3 -m venv venv
source venv/bin/activate
pip install -r manual_flagging/requirements.txt
python3 scripts/export.py > sales_val_flags.csv

You may have to run aws-mfa if you haven't already today.

Questions

A couple of big-picture design questions about this code:

Should this script live here in model-sales-val, or instead in https://github.com/ccao-data/data-architecture? On the one hand, it largely references data produced by this repo, but on the other hand it represents a type of export that is currently uncommon but that we expect to become more common across our data warehouse in the future. From a code perspective, there's no reason why it needs to live in one repo or the other, but once we start doing exports like this more regularly I wonder if we'll want them all to live in one place.
What sorts of flags should the script accept for filtering the flag output? And should we implement those now, or wait until we know more about our export cadence? I think it makes sense to wait, but I'm curious what you think in general. Some ideas:
- Filter by one or more run IDs, so that e.g. we can generate a new run and then only export the flags it created
- Filter by run date, so we can e.g. export all flags created after a certain date
- Any others?

Closes #113.

scripts/export.py

jeancochrane · 2024-03-27T20:29:29Z

scripts/export.py

+    # Run some data integrity checks
+    not_null_fields = [PIN_FIELD, SALE_KEY_FIELD, RUN_ID_FIELD]
+    for field in not_null_fields:
+        assert flag_df[flag_df[field].isnull()].empty, f"{field} contains nulls"
+
+    assert flag_df[
+        ~flag_df[OUTLIER_TYPE_FIELD].isin(OUTLIER_TYPE_CODES.values())
+    ].empty, f"{OUTLIER_TYPE_FIELD} contains invalid codes"
+
+    assert flag_df[
+        (flag_df[IS_OUTLIER_FIELD] == "Y")
+        & (flag_df[OUTLIER_TYPE_FIELD] == OUTLIER_TYPE_CODES["Not outlier"])
+    ].empty, (
+        f"{OUTLIER_TYPE_FIELD} cannot be {OUTLIER_TYPE_CODES['Not outlier']} "
+        f"when {IS_OUTLIER_FIELD} is Y"
+    )
+
+    assert flag_df[
+        (flag_df[IS_OUTLIER_FIELD] == "N")
+        & (flag_df[OUTLIER_TYPE_FIELD] != OUTLIER_TYPE_CODES["Not outlier"])
+    ].empty, (
+        f"{OUTLIER_TYPE_FIELD} must be {OUTLIER_TYPE_CODES['Not outlier']} "
+        f"when {IS_OUTLIER_FIELD} is N"
+    )
+
+    assert (
+        num_flags == expected_num_flags
+    ), f"Expected {expected_num_flags} total sales, got {num_flags}"


At some point it would be nice to copy these tests over to dbt to test the iasWorld flag table, but whether or not we decide to do that, I still think it'll be useful to have some guardrails here that will warn us if we accidentally introduce a bad join that clobbers the export data in some unforeseen way.

praise: Agree we should move it over, but this is great. Would like to see more defensive coding like this elsewhere in our repos.

dfsnow

@jeancochrane This is great! Thanks for the quick turnaround. To answer your questions:

Should this script live here in model-sales-val, or instead in https://github.com/ccao-data/data-architecture? On the one hand, it largely references data produced by this repo, but on the other hand it represents a type of export that is currently uncommon but that we expect to become more common across our data warehouse in the future. From a code perspective, there's no reason why it needs to live in one repo or the other, but once we start doing exports like this more regularly I wonder if we'll want them all to live in one place.

IMO it should probably live data-architecture long-term. I imagine most of our "moving data around" scripts will live there.

What sorts of flags should the script accept for filtering the flag output? And should we implement those now, or wait until we know more about our export cadence? I think it makes sense to wait, but I'm curious what you think in general. Some ideas:
Filter by one or more run IDs, so that e.g. we can generate a new run and then only export the flags it created
Filter by run date, so we can e.g. export all flags created after a certain date
Any others?

The main one I expect will be something like "export flags for sales without a flag in iasWorld" or "export flags that overwrite this existing run". But we can wait on the implementation until we know a bit more after next week.

scripts/export.py

dfsnow · 2024-03-28T15:00:31Z

scripts/export.py

+    # Run some data integrity checks
+    not_null_fields = [PIN_FIELD, SALE_KEY_FIELD, RUN_ID_FIELD]
+    for field in not_null_fields:
+        assert flag_df[flag_df[field].isnull()].empty, f"{field} contains nulls"
+
+    assert flag_df[
+        ~flag_df[OUTLIER_TYPE_FIELD].isin(OUTLIER_TYPE_CODES.values())
+    ].empty, f"{OUTLIER_TYPE_FIELD} contains invalid codes"
+
+    assert flag_df[
+        (flag_df[IS_OUTLIER_FIELD] == "Y")
+        & (flag_df[OUTLIER_TYPE_FIELD] == OUTLIER_TYPE_CODES["Not outlier"])
+    ].empty, (
+        f"{OUTLIER_TYPE_FIELD} cannot be {OUTLIER_TYPE_CODES['Not outlier']} "
+        f"when {IS_OUTLIER_FIELD} is Y"
+    )
+
+    assert flag_df[
+        (flag_df[IS_OUTLIER_FIELD] == "N")
+        & (flag_df[OUTLIER_TYPE_FIELD] != OUTLIER_TYPE_CODES["Not outlier"])
+    ].empty, (
+        f"{OUTLIER_TYPE_FIELD} must be {OUTLIER_TYPE_CODES['Not outlier']} "
+        f"when {IS_OUTLIER_FIELD} is N"
+    )
+
+    assert (
+        num_flags == expected_num_flags
+    ), f"Expected {expected_num_flags} total sales, got {num_flags}"


praise: Agree we should move it over, but this is great. Would like to see more defensive coding like this elsewhere in our repos.

wagnerlmichael

Looks good to me!

jeancochrane · 2024-03-28T15:42:10Z

IMO it should probably live data-architecture long-term. I imagine most of our "moving data around" scripts will live there.

That makes sense to me @dfsnow -- I'll merge the script here so we can start using it immediately and then open up an issue to move it to data-architecture.

jeancochrane added 2 commits March 27, 2024 12:18

Add scripts/export.py script for exporting flags to iasWorld

d3eeabd

Update README for export instructions

61ad659

jeancochrane linked an issue Mar 27, 2024 that may be closed by this pull request

Export flags to CSV for iasWorld upload #113

Closed

jeancochrane added 4 commits March 27, 2024 13:01

Appease pre-commit

bf517d9

Update README instructions for scrpits/export.py

f1bb6ee

Add logging and check output length in scripts/export.py

b033652

Slightly faster logging in scripts/export.py

49d537e

jeancochrane commented Mar 27, 2024

View reviewed changes

scripts/export.py Show resolved Hide resolved

Updates to script/export.py following sync meeting

d324502

jeancochrane commented Mar 27, 2024

View reviewed changes

jeancochrane marked this pull request as ready for review March 27, 2024 20:31

jeancochrane requested review from dfsnow and wagnerlmichael March 27, 2024 20:31

dfsnow approved these changes Mar 28, 2024

View reviewed changes

wagnerlmichael approved these changes Mar 28, 2024

View reviewed changes

jeancochrane merged commit bd67567 into main Mar 28, 2024
2 checks passed

jeancochrane deleted the jeancochrane/113-export-flags-to-csv-for-iasworld-upload branch March 28, 2024 15:42

This was referenced Mar 28, 2024

Move sales val export script from model-sales-val to this repo ccao-data/data-architecture#364

Open

Flip outlier indicator and use longer value fields in scripts/export.csv #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `scripts/export.py` script for exporting flags to iasWorld #114

Add `scripts/export.py` script for exporting flags to iasWorld #114

jeancochrane commented Mar 27, 2024 •

edited

Loading

jeancochrane Mar 27, 2024 •

edited

Loading

dfsnow Mar 28, 2024

dfsnow left a comment

dfsnow Mar 28, 2024

wagnerlmichael left a comment

jeancochrane commented Mar 28, 2024

Add scripts/export.py script for exporting flags to iasWorld #114

Add scripts/export.py script for exporting flags to iasWorld #114

Conversation

jeancochrane commented Mar 27, 2024 • edited Loading

Testing

Questions

jeancochrane Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

dfsnow Mar 28, 2024

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

dfsnow Mar 28, 2024

Choose a reason for hiding this comment

wagnerlmichael left a comment

Choose a reason for hiding this comment

jeancochrane commented Mar 28, 2024

Add `scripts/export.py` script for exporting flags to iasWorld #114

Add `scripts/export.py` script for exporting flags to iasWorld #114

jeancochrane commented Mar 27, 2024 •

edited

Loading

jeancochrane Mar 27, 2024 •

edited

Loading