Update chicago extraction script and workflow to enable date filtering and deduplication #14

jeancochrane · 2024-01-09T20:39:57Z

This PR updates the extract-chicago-permits workflow and associated Python script to read the values passed to the start_date, end_date, and deduplicate inputs, and use them to filter incoming permit data.

Logs for a successful workflow run: https://github.com/ccao-data/extract-permits/actions/runs/7467702318/job/20321702072#step:7:24

Closes #8 and #9.

jeancochrane · 2024-01-09T21:10:11Z

chicago/permit_cleaning.py

+The script also expects three positional arguments:
+    * start_date (str, YYYY-MM-DD): The lower bound date to use for filtering permits
+    * end_date (str, YYYY-MM-DD): The upper bound date to use for filtering
+    * deduplicate (bool): Whether to filter out permits that already exist in iasworld


It would be more ergonomic to accept these as named args instead of positional args, but it would be more effort and more complexity, so as long as this script is only intended to be run on GitHub Actions it makes sense to me to go with the simpler option.

jeancochrane · 2024-01-09T21:12:54Z

chicago/permit_cleaning.py

+    new_permits["amount"] = new_permits["amount"].apply(
+        lambda x: decimal.Decimal("{:.2f}".format(x))
+    )
+    new_permits["permdt"] = new_permits["permdt"].apply(
+        lambda x: datetime.strptime(x, "%m/%d/%Y").strftime(
+            "%Y-%m-%d %H:%M:%S.%f"
+        )[:-5]
+    )
+    new_permits["note2"] = new_permits["note2"] + ",,CHICAGO, IL"
+    new_permits["user43"] = new_permits["user43"].str.replace(
+        "(", ""
+    ).replace(")", "")
+    new_permits["user43"] = new_permits["user43"].str.slice(0, 261)


These transformations don't seem to be replicated in Caroline's code anywhere, but they seem necessary according to my QA. My guess is that this represents processing that is done between smartfile and ias; if that sounds right, perhaps I can check in with Will to make sure that's correct? Otherwise, we likely want to be doing this processing in the main body of the script instead of doing it as part of the deduplication step.

I'm guessing that's correct, but I would absolutely check with Will. Ask him more generally about the de-duping too, since it's still unclear to me what level of de-duping we want.

jeancochrane · 2024-01-09T21:13:29Z

chicago/permit_cleaning.py

@@ -301,28 +367,63 @@ def save_xlsx_files(df, max_rows, file_base_name):
    df_review_empty_invalid.to_excel(file_name_review_empty_invalid, index=False, engine="xlsxwriter")


+if __name__ == "__main__":


Most of the diff that follows represents indentation changes due to the addition of the if __name__ == "__main__" conditional block here. I'll call out semantic changes via comment.

jeancochrane · 2024-01-09T21:14:08Z

chicago/permit_cleaning.py

+    print(
+        f"Downloaded {len(permits)} "
+        f"permit{'' if len(permits) == 1 else 's'} "
+        f"between {start_date} and {end_date}"
+    )


This logging is new.

jeancochrane · 2024-01-09T21:14:18Z

chicago/permit_cleaning.py

+    start_date, end_date, deduplicate = sys.argv[1], sys.argv[2], sys.argv[3]
+    deduplicate = deduplicate.lower() == "true"


This arg parsing is new.

jeancochrane · 2024-01-09T21:14:42Z

chicago/permit_cleaning.py

+    if deduplicate:
+        print(
+            "Number of permits prior to deduplication: "
+            f"{len(permits_shortened)}"
+        )
+        permits_deduped = deduplicate_permits(
+            cursor,
+            permits_shortened,
+            start_date,
+            end_date
+        )
+        print(
+            "Number of permits after deduplication: "
+            f"{len(permits_deduped)}"
+        )
+    else:
+        permits_deduped = permits_shortened


This dedupe step is new.

jeancochrane · 2024-01-09T22:31:22Z

chicago/permit_cleaning.py

@@ -89,9 +93,17 @@ def expand_multi_pin_permits(df):
 # update pin to match formatting of iasWorld
 def format_pin(df):
    # iasWorld format doesn't include dashes
-    df["pin_final"] = df["solo_pin"].astype(str).str.replace("-", "")
+    df["pin_final"] = df["solo_pin"].astype("string").str.replace("-", "")


The alias for the pandas string type is actually "string" (docs); some PIN comparisons were previously failing due to differing data types, so I just went through and made sure everything is an actual string.

jeancochrane · 2024-01-09T22:32:20Z

chicago/permit_cleaning.py

+    def pad_pin(pin):
+        if not pd.isna(pin):
+            if len(pin) == 10:
+                return pin + "0000"
+            else:
+                return pin
+        else:
+            return ""


The nested ternary syntax always confuses me, so I factored it out into a full function to aid in debugging. I think this is clearer to read anyway, so I left it in.

Shouldn't have told me this. I'm writing my next PR with triple-nested list comprehensions.

jeancochrane · 2024-01-09T22:33:01Z

chicago/permit_cleaning.py

+        chicago_pin_universe = pd.read_csv(
+            "chicago_pin_universe.csv",
+            dtype={"pin": "string", "pin10": "string"}
+        )


Previously we weren't specifying dtypes for the CSV, so pandas was inferring a float type for all of the PINs.

dfsnow

@jeancochrane This looks good to me! Excited to see this up and running.

dfsnow · 2024-01-10T20:43:04Z

chicago/permit_cleaning.py

+    def pad_pin(pin):
+        if not pd.isna(pin):
+            if len(pin) == 10:
+                return pin + "0000"
+            else:
+                return pin
+        else:
+            return ""


Shouldn't have told me this. I'm writing my next PR with triple-nested list comprehensions.

dfsnow · 2024-01-10T20:55:20Z

chicago/permit_cleaning.py

+    new_permits["amount"] = new_permits["amount"].apply(
+        lambda x: decimal.Decimal("{:.2f}".format(x))
+    )
+    new_permits["permdt"] = new_permits["permdt"].apply(
+        lambda x: datetime.strptime(x, "%m/%d/%Y").strftime(
+            "%Y-%m-%d %H:%M:%S.%f"
+        )[:-5]
+    )
+    new_permits["note2"] = new_permits["note2"] + ",,CHICAGO, IL"
+    new_permits["user43"] = new_permits["user43"].str.replace(
+        "(", ""
+    ).replace(")", "")
+    new_permits["user43"] = new_permits["user43"].str.slice(0, 261)


I'm guessing that's correct, but I would absolutely check with Will. Ask him more generally about the de-duping too, since it's still unclear to me what level of de-duping we want.

dfsnow · 2024-01-10T20:57:38Z

chicago/permit_cleaning.py

+        new_permits,
+        existing_permits,
+        how="left",
+        on=list(workbook_to_iasworld_col_map.values()),


Just checking, we're joining on basically all the fields in the workbook, and all the preprocessing is necessary to make the join work, yeah? Are there any times when the ingest into SmartFile/ias removes data?

Yup, that's right! Ingest into SmartFile/ias does sometimes remove data, but the preceding transformation steps along with the ones introduced in this PR should cover those transformations. I'll double-check with Will today to make sure there's nothing we missed.

jeancochrane added 3 commits January 8, 2024 15:00

Update extract-chicago-permits to use start_date and end_date inputs

42ad62e

Sketch out incorporation of deduplicate flag

c5f382f

Flesh out deduplicate flag

8617027

jeancochrane linked an issue Jan 9, 2024 that may be closed by this pull request

Update Chicago extraction script and workflow to enable input arguments #9

Closed

Clean up docstring in chicago/permit_cleaning.py

968e7ea

jeancochrane commented Jan 9, 2024

View reviewed changes

jeancochrane changed the title ~~Update chicago extraction script and workflow to enable input arguments~~ Update chicago extraction script and workflow to enable date filtering and deduplication Jan 9, 2024

Fix string type comparisons

5b3168a

jeancochrane commented Jan 9, 2024

View reviewed changes

jeancochrane marked this pull request as ready for review January 9, 2024 23:21

jeancochrane requested a review from dfsnow January 9, 2024 23:21

dfsnow approved these changes Jan 10, 2024

View reviewed changes

jeancochrane merged commit 1551920 into main Jan 16, 2024
1 check passed

jeancochrane deleted the jeancochrane/9-update-chicago-extraction-script-and-workflow-to-enable-input-arguments branch January 16, 2024 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update chicago extraction script and workflow to enable date filtering and deduplication #14

Update chicago extraction script and workflow to enable date filtering and deduplication #14

jeancochrane commented Jan 9, 2024 •

edited

Loading

jeancochrane Jan 9, 2024

jeancochrane Jan 9, 2024

dfsnow Jan 10, 2024

jeancochrane Jan 9, 2024

jeancochrane Jan 9, 2024

jeancochrane Jan 9, 2024

jeancochrane Jan 9, 2024

jeancochrane Jan 9, 2024

jeancochrane Jan 9, 2024

dfsnow Jan 10, 2024

jeancochrane Jan 9, 2024

dfsnow left a comment

dfsnow Jan 10, 2024

dfsnow Jan 10, 2024

dfsnow Jan 10, 2024

jeancochrane Jan 16, 2024

		@@ -301,28 +367,63 @@ def save_xlsx_files(df, max_rows, file_base_name):
		df_review_empty_invalid.to_excel(file_name_review_empty_invalid, index=False, engine="xlsxwriter")


		if __name__ == "__main__":

		start_date, end_date, deduplicate = sys.argv[1], sys.argv[2], sys.argv[3]
		deduplicate = deduplicate.lower() == "true"

Update chicago extraction script and workflow to enable date filtering and deduplication #14

Update chicago extraction script and workflow to enable date filtering and deduplication #14

Conversation

jeancochrane commented Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented Jan 9, 2024 •

edited

Loading