-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update chicago extraction script and workflow to enable date filtering and deduplication #14
Conversation
The script also expects three positional arguments: | ||
* start_date (str, YYYY-MM-DD): The lower bound date to use for filtering permits | ||
* end_date (str, YYYY-MM-DD): The upper bound date to use for filtering | ||
* deduplicate (bool): Whether to filter out permits that already exist in iasworld |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more ergonomic to accept these as named args instead of positional args, but it would be more effort and more complexity, so as long as this script is only intended to be run on GitHub Actions it makes sense to me to go with the simpler option.
new_permits["amount"] = new_permits["amount"].apply( | ||
lambda x: decimal.Decimal("{:.2f}".format(x)) | ||
) | ||
new_permits["permdt"] = new_permits["permdt"].apply( | ||
lambda x: datetime.strptime(x, "%m/%d/%Y").strftime( | ||
"%Y-%m-%d %H:%M:%S.%f" | ||
)[:-5] | ||
) | ||
new_permits["note2"] = new_permits["note2"] + ",,CHICAGO, IL" | ||
new_permits["user43"] = new_permits["user43"].str.replace( | ||
"(", "" | ||
).replace(")", "") | ||
new_permits["user43"] = new_permits["user43"].str.slice(0, 261) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These transformations don't seem to be replicated in Caroline's code anywhere, but they seem necessary according to my QA. My guess is that this represents processing that is done between smartfile and ias; if that sounds right, perhaps I can check in with Will to make sure that's correct? Otherwise, we likely want to be doing this processing in the main body of the script instead of doing it as part of the deduplication step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that's correct, but I would absolutely check with Will. Ask him more generally about the de-duping too, since it's still unclear to me what level of de-duping we want.
@@ -301,28 +367,63 @@ def save_xlsx_files(df, max_rows, file_base_name): | |||
df_review_empty_invalid.to_excel(file_name_review_empty_invalid, index=False, engine="xlsxwriter") | |||
|
|||
|
|||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the diff that follows represents indentation changes due to the addition of the if __name__ == "__main__"
conditional block here. I'll call out semantic changes via comment.
print( | ||
f"Downloaded {len(permits)} " | ||
f"permit{'' if len(permits) == 1 else 's'} " | ||
f"between {start_date} and {end_date}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logging is new.
start_date, end_date, deduplicate = sys.argv[1], sys.argv[2], sys.argv[3] | ||
deduplicate = deduplicate.lower() == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This arg parsing is new.
if deduplicate: | ||
print( | ||
"Number of permits prior to deduplication: " | ||
f"{len(permits_shortened)}" | ||
) | ||
permits_deduped = deduplicate_permits( | ||
cursor, | ||
permits_shortened, | ||
start_date, | ||
end_date | ||
) | ||
print( | ||
"Number of permits after deduplication: " | ||
f"{len(permits_deduped)}" | ||
) | ||
else: | ||
permits_deduped = permits_shortened |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dedupe step is new.
@@ -89,9 +93,17 @@ def expand_multi_pin_permits(df): | |||
# update pin to match formatting of iasWorld | |||
def format_pin(df): | |||
# iasWorld format doesn't include dashes | |||
df["pin_final"] = df["solo_pin"].astype(str).str.replace("-", "") | |||
df["pin_final"] = df["solo_pin"].astype("string").str.replace("-", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alias for the pandas string type is actually "string"
(docs); some PIN comparisons were previously failing due to differing data types, so I just went through and made sure everything is an actual string.
def pad_pin(pin): | ||
if not pd.isna(pin): | ||
if len(pin) == 10: | ||
return pin + "0000" | ||
else: | ||
return pin | ||
else: | ||
return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nested ternary syntax always confuses me, so I factored it out into a full function to aid in debugging. I think this is clearer to read anyway, so I left it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't have told me this. I'm writing my next PR with triple-nested list comprehensions.
chicago_pin_universe = pd.read_csv( | ||
"chicago_pin_universe.csv", | ||
dtype={"pin": "string", "pin10": "string"} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we weren't specifying dtypes for the CSV, so pandas was inferring a float type for all of the PINs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeancochrane This looks good to me! Excited to see this up and running.
def pad_pin(pin): | ||
if not pd.isna(pin): | ||
if len(pin) == 10: | ||
return pin + "0000" | ||
else: | ||
return pin | ||
else: | ||
return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't have told me this. I'm writing my next PR with triple-nested list comprehensions.
new_permits["amount"] = new_permits["amount"].apply( | ||
lambda x: decimal.Decimal("{:.2f}".format(x)) | ||
) | ||
new_permits["permdt"] = new_permits["permdt"].apply( | ||
lambda x: datetime.strptime(x, "%m/%d/%Y").strftime( | ||
"%Y-%m-%d %H:%M:%S.%f" | ||
)[:-5] | ||
) | ||
new_permits["note2"] = new_permits["note2"] + ",,CHICAGO, IL" | ||
new_permits["user43"] = new_permits["user43"].str.replace( | ||
"(", "" | ||
).replace(")", "") | ||
new_permits["user43"] = new_permits["user43"].str.slice(0, 261) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that's correct, but I would absolutely check with Will. Ask him more generally about the de-duping too, since it's still unclear to me what level of de-duping we want.
new_permits, | ||
existing_permits, | ||
how="left", | ||
on=list(workbook_to_iasworld_col_map.values()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, we're joining on basically all the fields in the workbook, and all the preprocessing is necessary to make the join work, yeah? Are there any times when the ingest into SmartFile/ias removes data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's right! Ingest into SmartFile/ias does sometimes remove data, but the preceding transformation steps along with the ones introduced in this PR should cover those transformations. I'll double-check with Will today to make sure there's nothing we missed.
This PR updates the
extract-chicago-permits
workflow and associated Python script to read the values passed to thestart_date
,end_date
, anddeduplicate
inputs, and use them to filter incoming permit data.Logs for a successful workflow run: https://github.com/ccao-data/extract-permits/actions/runs/7467702318/job/20321702072#step:7:24
Closes #8 and #9.