Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update chicago extraction script and workflow to enable date filtering and deduplication #14

Conversation

jeancochrane
Copy link
Collaborator

@jeancochrane jeancochrane commented Jan 9, 2024

This PR updates the extract-chicago-permits workflow and associated Python script to read the values passed to the start_date, end_date, and deduplicate inputs, and use them to filter incoming permit data.

Logs for a successful workflow run: https://github.com/ccao-data/extract-permits/actions/runs/7467702318/job/20321702072#step:7:24

Closes #8 and #9.

Comment on lines +15 to +18
The script also expects three positional arguments:
* start_date (str, YYYY-MM-DD): The lower bound date to use for filtering permits
* end_date (str, YYYY-MM-DD): The upper bound date to use for filtering
* deduplicate (bool): Whether to filter out permits that already exist in iasworld
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more ergonomic to accept these as named args instead of positional args, but it would be more effort and more complexity, so as long as this script is only intended to be run on GitHub Actions it makes sense to me to go with the simpler option.

Comment on lines +278 to +290
new_permits["amount"] = new_permits["amount"].apply(
lambda x: decimal.Decimal("{:.2f}".format(x))
)
new_permits["permdt"] = new_permits["permdt"].apply(
lambda x: datetime.strptime(x, "%m/%d/%Y").strftime(
"%Y-%m-%d %H:%M:%S.%f"
)[:-5]
)
new_permits["note2"] = new_permits["note2"] + ",,CHICAGO, IL"
new_permits["user43"] = new_permits["user43"].str.replace(
"(", ""
).replace(")", "")
new_permits["user43"] = new_permits["user43"].str.slice(0, 261)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These transformations don't seem to be replicated in Caroline's code anywhere, but they seem necessary according to my QA. My guess is that this represents processing that is done between smartfile and ias; if that sounds right, perhaps I can check in with Will to make sure that's correct? Otherwise, we likely want to be doing this processing in the main body of the script instead of doing it as part of the deduplication step.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that's correct, but I would absolutely check with Will. Ask him more generally about the de-duping too, since it's still unclear to me what level of de-duping we want.

@@ -301,28 +367,63 @@ def save_xlsx_files(df, max_rows, file_base_name):
df_review_empty_invalid.to_excel(file_name_review_empty_invalid, index=False, engine="xlsxwriter")


if __name__ == "__main__":
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the diff that follows represents indentation changes due to the addition of the if __name__ == "__main__" conditional block here. I'll call out semantic changes via comment.

Comment on lines +393 to +397
print(
f"Downloaded {len(permits)} "
f"permit{'' if len(permits) == 1 else 's'} "
f"between {start_date} and {end_date}"
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logging is new.

Comment on lines +389 to +390
start_date, end_date, deduplicate = sys.argv[1], sys.argv[2], sys.argv[3]
deduplicate = deduplicate.lower() == "true"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This arg parsing is new.

Comment on lines +409 to +425
if deduplicate:
print(
"Number of permits prior to deduplication: "
f"{len(permits_shortened)}"
)
permits_deduped = deduplicate_permits(
cursor,
permits_shortened,
start_date,
end_date
)
print(
"Number of permits after deduplication: "
f"{len(permits_deduped)}"
)
else:
permits_deduped = permits_shortened
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dedupe step is new.

@jeancochrane jeancochrane changed the title Update chicago extraction script and workflow to enable input arguments Update chicago extraction script and workflow to enable date filtering and deduplication Jan 9, 2024
@@ -89,9 +93,17 @@ def expand_multi_pin_permits(df):
# update pin to match formatting of iasWorld
def format_pin(df):
# iasWorld format doesn't include dashes
df["pin_final"] = df["solo_pin"].astype(str).str.replace("-", "")
df["pin_final"] = df["solo_pin"].astype("string").str.replace("-", "")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alias for the pandas string type is actually "string" (docs); some PIN comparisons were previously failing due to differing data types, so I just went through and made sure everything is an actual string.

Comment on lines +98 to +105
def pad_pin(pin):
if not pd.isna(pin):
if len(pin) == 10:
return pin + "0000"
else:
return pin
else:
return ""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested ternary syntax always confuses me, so I factored it out into a full function to aid in debugging. I think this is clearer to read anyway, so I left it in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't have told me this. I'm writing my next PR with triple-nested list comprehensions.

Comment on lines +392 to +395
chicago_pin_universe = pd.read_csv(
"chicago_pin_universe.csv",
dtype={"pin": "string", "pin10": "string"}
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we weren't specifying dtypes for the CSV, so pandas was inferring a float type for all of the PINs.

@jeancochrane jeancochrane marked this pull request as ready for review January 9, 2024 23:21
@jeancochrane jeancochrane requested a review from dfsnow January 9, 2024 23:21
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeancochrane This looks good to me! Excited to see this up and running.

Comment on lines +98 to +105
def pad_pin(pin):
if not pd.isna(pin):
if len(pin) == 10:
return pin + "0000"
else:
return pin
else:
return ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't have told me this. I'm writing my next PR with triple-nested list comprehensions.

Comment on lines +278 to +290
new_permits["amount"] = new_permits["amount"].apply(
lambda x: decimal.Decimal("{:.2f}".format(x))
)
new_permits["permdt"] = new_permits["permdt"].apply(
lambda x: datetime.strptime(x, "%m/%d/%Y").strftime(
"%Y-%m-%d %H:%M:%S.%f"
)[:-5]
)
new_permits["note2"] = new_permits["note2"] + ",,CHICAGO, IL"
new_permits["user43"] = new_permits["user43"].str.replace(
"(", ""
).replace(")", "")
new_permits["user43"] = new_permits["user43"].str.slice(0, 261)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that's correct, but I would absolutely check with Will. Ask him more generally about the de-duping too, since it's still unclear to me what level of de-duping we want.

new_permits,
existing_permits,
how="left",
on=list(workbook_to_iasworld_col_map.values()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, we're joining on basically all the fields in the workbook, and all the preprocessing is necessary to make the join work, yeah? Are there any times when the ingest into SmartFile/ias removes data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's right! Ingest into SmartFile/ias does sometimes remove data, but the preceding transformation steps along with the ones introduced in this PR should cover those transformations. I'll double-check with Will today to make sure there's nothing we missed.

@jeancochrane jeancochrane merged commit 1551920 into main Jan 16, 2024
1 check passed
@jeancochrane jeancochrane deleted the jeancochrane/9-update-chicago-extraction-script-and-workflow-to-enable-input-arguments branch January 16, 2024 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants