diff --git a/.github/ISSUE_TEMPLATE/release-database.md b/.github/ISSUE_TEMPLATE/release-database.md index 6fbeaee..055c32c 100644 --- a/.github/ISSUE_TEMPLATE/release-database.md +++ b/.github/ISSUE_TEMPLATE/release-database.md @@ -20,6 +20,7 @@ labels: release - [ ] Using the command line, grab the final compressed database file from the temporary directory (found at `db_path` after running `data-raw/create_db.R`) and move it to the project directory. Rename the file `ptaxsim-...db.bz2` - [ ] Decompress the database file for local testing using `pbzip2`. The typical command will be something like `pbzip2 -d -k ptaxsim-2021.0.2.db.bz2` - [ ] Rename the decompressed local database file to `ptaxsim.db` for local testing. This is the file name that the unit tests and vignettes expect +- [ ] Use [sqldiff](https://www.sqlite.org/sqldiff.html) or a similar tool to compare the new database file to the previous version. Ensure that the changes are expected - [ ] Restart R. Then run the unit tests (`devtools::test()` in the console) and vignettes (`pkgdown::build_site()` in the console) locally - [ ] Knit the `README.Rmd` file to update the database link at the top of the README. The link is pulled from the `ptaxsim.db` file's `metadata` table - [ ] If necessary, update the database diagrams in the README with any new fields or tables diff --git a/DESCRIPTION b/DESCRIPTION index 8a9cae4..3b5f901 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -22,7 +22,7 @@ Imports: glue, RSQLite, utils -RoxygenNote: 7.2.3 +RoxygenNote: 7.3.0 Suggests: arrow, covr, @@ -36,8 +36,10 @@ Suggests: httr, knitr, lintr, + noctua, odbc, openxlsx, + pdftools, pkgdown, prettymapr, purrr, @@ -60,4 +62,4 @@ Remotes: paleolimbot/geoarrow, ropensci/tabulizer Config/Requires_DB_Version: 2021.0.4 -Config/Wants_DB_Version: 2021.0.4 +Config/Wants_DB_Version: 2022.0.0 diff --git a/R/tax_bill.R b/R/tax_bill.R index 6d6f2b2..5ebca11 100644 --- a/R/tax_bill.R +++ b/R/tax_bill.R @@ -193,7 +193,7 @@ tax_bill <- function(year_vec, # Calculate the exemption effect by subtracting the exempt amount from # the total taxable EAV - dt[, agency_tax_rate := agency_total_ext / agency_total_eav] + dt[, agency_tax_rate := agency_total_ext / as.numeric(agency_total_eav)] dt[, tax_amt_exe := exe_total * agency_tax_rate] dt[, tax_amt_pre_exe := round(eav * agency_tax_rate, 2)] dt[, tax_amt_post_exe := round(tax_amt_pre_exe - tax_amt_exe, 2)] diff --git a/README.Rmd b/README.Rmd index 695d32f..b9509ad 100644 --- a/README.Rmd +++ b/README.Rmd @@ -555,9 +555,9 @@ erDiagram ## Notes and caveats -- Currently, the per-district tax calculations for properties in the Red-Purple Modernization (RPM) TIF are slightly flawed. However, the total tax bill per PIN is still accurate. See issue [#11](#11) for more information. -- Special Service Area (SSA) rates must be calculated manually when creating counterfactual bills. See issue [#31](#31) for more information. -- In rare instances, a TIF can have multiple `agency_num` identifiers (usually there's only one per TIF). The `tif_crosswalk` table determines what the "main" `agency_num` is for each TIF and pulls the name and TIF information using that identifier. See issue [#39](#39) for more information. +- Currently, the per-district tax calculations for properties in the Red-Purple Modernization (RPM) TIF are slightly flawed. However, the total tax bill per PIN is still accurate. See issue [#4](https://github.com/ccao-data/ptaxsim/issues/4) for more information. +- Special Service Area (SSA) rates must be calculated manually when creating counterfactual bills. See issue [#3](https://github.com/ccao-data/ptaxsim/issues/3) for more information. +- In rare instances, a TIF can have multiple `agency_num` identifiers (usually there's only one per TIF). The `tif_crosswalk` table determines what the "main" `agency_num` is for each TIF and pulls the name and TIF information using that identifier. See issue [GitLab #39](https://gitlab.com/ccao-data-science---modeling/packages/ptaxsim/-/issues/39) for more information. - PTAXSIM is relatively memory-efficient and can calculate every district line-item for every tax bill for the last 15 years (roughly 350 million rows). However, the memory required for this calculation is substantial (around 100 GB). - PTAXSIM's accuracy is measured automatically with an [integration test](tests/testthat/test-accuracy.R). The test takes a random sample of 1 million PINs, calculates the total bill for each PIN, and compares it to the real total bill. - This repository contains an edited version of PTAXSIM's commit history. Historical Git LFS and other data files (.csv, .xlsx, etc.) were removed in the transition to GitHub. The most current version of these files is available starting in commit [1f06639](https://github.com/ccao-data/ptaxsim/commit/1f06639d98a720999222579b7ff61bcce061f1ec). If you need the historical LFS files for any reason, please visit the [GitLab archive](https://gitlab.com/ccao-data-science---modeling/packages/ptaxsim) of this repository. diff --git a/README.md b/README.md index 954da5a..22fd412 100644 --- a/README.md +++ b/README.md @@ -37,13 +37,13 @@ Table of Contents > installation](#database-installation) for details. > > [**Link to PTAXSIM -> database**](https://ccao-data-public-us-east-1.s3.amazonaws.com/ptaxsim/ptaxsim-2021.0.4.db.bz2) -> (DB version: 2021.0.4; Last updated: 2023-04-28 23:40:05) +> database**](https://ccao-data-public-us-east-1.s3.amazonaws.com/ptaxsim/ptaxsim-2022.0.0.db.bz2) +> (DB version: 2022.0.0; Last updated: 2024-01-19 04:40:35) PTAXSIM is an R package/database to approximate Cook County property tax bills. It uses real assessment, exemption, TIF, and levy data to generate historic, line-item tax bills (broken out by taxing district) -for any property from 2006 to 2021. Given some careful assumptions and +for any property from 2006 to 2022. Given some careful assumptions and data manipulation, it can also provide hypothetical, but factually grounded, answers to questions such as: @@ -173,9 +173,9 @@ database: 1. Download the compressed database file from the CCAO’s public S3 bucket. [Link - here](https://ccao-data-public-us-east-1.s3.amazonaws.com/ptaxsim/ptaxsim-2021.0.4.db.bz2). + here](https://ccao-data-public-us-east-1.s3.amazonaws.com/ptaxsim/ptaxsim-2022.0.0.db.bz2). 2. (Optional) Rename the downloaded database file by removing the - version number, i.e. ptaxsim-2021.0.4.db.bz2 becomes + version number, i.e. ptaxsim-2022.0.0.db.bz2 becomes `ptaxsim.db.bz2`. 3. Decompress the downloaded database file. The file is compressed using [bzip2](https://sourceware.org/bzip2/). @@ -863,15 +863,18 @@ erDiagram - Currently, the per-district tax calculations for properties in the Red-Purple Modernization (RPM) TIF are slightly flawed. However, the - total tax bill per PIN is still accurate. See issue [\#4](https://github.com/ccao-data/ptaxsim/issues/4) for - more information. + total tax bill per PIN is still accurate. See issue + [\#4](https://github.com/ccao-data/ptaxsim/issues/4) for more + information. - Special Service Area (SSA) rates must be calculated manually when - creating counterfactual bills. See issue [\#3](https://github.com/ccao-data/ptaxsim/issues/3) for more + creating counterfactual bills. See issue + [\#3](https://github.com/ccao-data/ptaxsim/issues/3) for more information. - In rare instances, a TIF can have multiple `agency_num` identifiers (usually there’s only one per TIF). The `tif_crosswalk` table determines what the “main” `agency_num` is for each TIF and pulls the - name and TIF information using that identifier. See archived issue [\#39](https://gitlab.com/ccao-data-science---modeling/packages/ptaxsim/-/issues/39) + name and TIF information using that identifier. See issue [GitLab + \#39](https://gitlab.com/ccao-data-science---modeling/packages/ptaxsim/-/issues/39) for more information. - PTAXSIM is relatively memory-efficient and can calculate every district line-item for every tax bill for the last 15 years (roughly diff --git a/data-raw/agency/Agency Rate Report 2022.xlsx b/data-raw/agency/Agency Rate Report 2022.xlsx new file mode 100644 index 0000000..1b7a766 --- /dev/null +++ b/data-raw/agency/Agency Rate Report 2022.xlsx @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b5eacae6ce69cf64f7903ffbe248a198422c1905614d0e14fa20467a867fe6a3 +size 882953 diff --git a/data-raw/agency/agency.R b/data-raw/agency/agency.R index 0a00dc6..45d5ff7 100644 --- a/data-raw/agency/agency.R +++ b/data-raw/agency/agency.R @@ -45,7 +45,6 @@ file_names <- list.files( - # agency_fund ------------------------------------------------------------------ # Load the detail sheet from each agency file. This includes the levy and rate @@ -64,7 +63,7 @@ agency_fund <- map_dfr(file_names, function(file) { "loss", "loss_percent", "fund_loss" ))) %>% rename_with(~"levy_plus_loss", any_of(c( - "levy_and_loss", "fund_levy_plus_loss" + "levy_and_loss", "fund_levy_plus_loss", "levy_loss" ))) %>% rename_with(~"rate_ceiling", any_of(c( "ceiling", "rate_ceiling", "fund_rate_ceiling" @@ -189,7 +188,7 @@ arrow::write_parquet( # EAV, final extension, and much more agency <- map_dfr(file_names, function(file) { message("Reading: ", file) - readxl::read_xlsx(file) %>% + readxl::read_xlsx(file, sheet = 1) %>% set_names(snakecase::to_snake_case(names(.))) %>% mutate( across( @@ -235,9 +234,12 @@ agency <- map_dfr(file_names, function(file) { "reduction_percent", "reduction_factor", "clerk_reduction_factor" ))) %>% rename_with(~"total_non_cap_ext", any_of(c( - "total_non_cap_ext", "final_non_cap_ext" + "total_non_cap_ext", "final_non_cap_ext", "total_non_cap_extension" + ))) %>% + rename_with(~"total_ext", any_of(c( + "total_ext", "final_ext", + "grand_total_ext" ))) %>% - rename_with(~"total_ext", any_of(c("total_ext", "final_ext"))) %>% # Select, order, and rename columns select( year, @@ -281,7 +283,7 @@ agency <- map_dfr(file_names, function(file) { 0, cty_cook_eav ), - across(starts_with("cty_"), replace_na, 0), + across(starts_with("cty_"), ~ replace_na(.x, 0)), # Make all percentages decimals across( c(pct_burden, reduction_pct), @@ -296,20 +298,20 @@ agency <- map_dfr(file_names, function(file) { arrange(year, agency_num) %>% # Coerce columns to expected types mutate( - across(c(year), as.character), + across(c(year), ~ as.character(.x)), across( c( lim_numerator, lim_denominator, prior_eav:cty_total_eav, total_levy, total_max_levy, total_reduced_levy, total_final_levy ), - as.integer64 + ~ as.integer64(.x) ), across( c( lim_rate, pct_burden, total_prelim_rate, total_final_rate, reduction_pct, total_non_cap_ext, total_ext ), - as.double + ~ as.double(.x) ) ) diff --git a/data-raw/agency/tif_agency_names.csv b/data-raw/agency/tif_agency_names.csv index 8a8edae..74f6c15 100644 --- a/data-raw/agency/tif_agency_names.csv +++ b/data-raw/agency/tif_agency_names.csv @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:fcbcf43f2a66232e2f309f75339ed5057da12c33d27692d270b37df776d9f46d -size 26934 +oid sha256:6ca59da53a3bdc008f0b514829aa6717f9fa1d99d421509a6754078d8b329f61 +size 27439 diff --git a/data-raw/cpi/cpi.R b/data-raw/cpi/cpi.R index 29246c0..98289f7 100644 --- a/data-raw/cpi/cpi.R +++ b/data-raw/cpi/cpi.R @@ -1,7 +1,7 @@ library(arrow) library(dplyr) library(miniUI) -library(tabulizer) +library(pdftools) library(tidyr) library(stringr) @@ -14,27 +14,34 @@ row_to_names <- function(df) { # The goal of this script is to create a data frame of Consumer Price Indices # CPI-U used by PTELL to calculate/cap property tax extensions # We can load the historical CPIs from a PDF provided by the State of Illinois +# https://tax.illinois.gov/content/dam/soi/en/web/tax/localgovernments/property/documents/cpihistory.pdf # nolint # Paths for local raw data storage and remote storage on S3 remote_bucket <- Sys.getenv("S3_REMOTE_BUCKET") remote_path <- file.path(remote_bucket, "cpi", "part-0.parquet") -# Extract the table only (no headers), then manually assign header -cpi_ext <- extract_areas(file = "data-raw/cpi/cpihistory.pdf")[[1]] -cpi <- as_tibble(cpi_ext[, c(1, 2, 4, 5, 6)]) -cpi <- setNames(cpi, c("year", "cpi", "ptell_cook", "comments", "levy_year")) +cpi <- pdftools::pdf_text(pdf = "data-raw/cpi/cpihistory.pdf") %>% + str_extract(., regex("1991.*", dotall = TRUE)) %>% + str_remove_all(., "\\(5 % for Cook\\)") %>% + str_split(., "\n") %>% + unlist() %>% + tibble(vals = `.`) %>% + mutate(vals = str_squish(vals)) %>% + separate_wider_delim( + col = vals, + names = c("year", "cpi", "pct", "ptell_cook", "levy_year", "year_paid"), + delim = " ", too_few = "align_start", too_many = "drop" + ) -# Merge Cook rate into main column cpi <- cpi %>% mutate( across(c(year, levy_year), as.character), across(c(cpi), as.numeric), - across(c(ptell_cook, comments), readr::parse_number), - ptell_cook = ifelse(!is.na(comments), comments, ptell_cook), + across(c(ptell_cook), readr::parse_number), ptell_cook = ptell_cook / 100 ) %>% - select(-comments) %>% - filter(year != "1991") %>% + filter(year != "1991", year != "", year != "CPI") %>% + select(-pct, -year_paid) %>% arrange(year) # Write to S3 diff --git a/data-raw/cpi/cpihistory.pdf b/data-raw/cpi/cpihistory.pdf index 3eed185..e6fedfd 100644 Binary files a/data-raw/cpi/cpihistory.pdf and b/data-raw/cpi/cpihistory.pdf differ diff --git a/data-raw/create_db.R b/data-raw/create_db.R index af73e37..dc0c1eb 100644 --- a/data-raw/create_db.R +++ b/data-raw/create_db.R @@ -37,7 +37,7 @@ db_send_queries <- function(conn, sql) { # changes. This is checked against Config/Requires_DB_Version in the DESCRIPTION # file via check_db_version(). Schema is: # "MAX_YEAR_OF_DATA.MAJOR_VERSION.MINOR_VERSION" -db_version <- "2021.0.4" +db_version <- "2022.0.0" # Set the package version required to use this database. This is checked against # Version in the DESCRIPTION file. Basically, we have a two-way check so that diff --git a/data-raw/eq_factor/eq_factor.csv b/data-raw/eq_factor/eq_factor.csv index d810123..21e3600 100644 --- a/data-raw/eq_factor/eq_factor.csv +++ b/data-raw/eq_factor/eq_factor.csv @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:b3013b8ac0e6ddae2eeb73a446c5d323904507e4f2b0562aa6acf4e034aeb82b -size 832 +oid sha256:2a27cbc30f2e5281f69006fda8c56bac0d6d7cebd577100e9452c77e379816b1 +size 850 diff --git a/data-raw/pin/pin.R b/data-raw/pin/pin.R index f67cbf4..a8303bb 100644 --- a/data-raw/pin/pin.R +++ b/data-raw/pin/pin.R @@ -2,6 +2,7 @@ library(arrow) library(DBI) library(dplyr) library(geoarrow) +library(noctua) library(odbc) library(sf) library(tidyr) @@ -27,6 +28,10 @@ ccaodata <- dbConnect( .connection_string = Sys.getenv("DB_CONFIG_CCAODATA") ) +# Establish a connection the Data Department's Athena data warehouse. We'll use +# values from here to fill in any missing values from the legacy system +ccaoathena <- dbConnect(noctua::athena()) + # Pull AV and class from the Clerk and HEAD tables, giving preference to values # from the Clerk table in case of mismatch (except for property class). # These tables are pulled from the AS/400 and will be pulled from iasWorld @@ -82,6 +87,25 @@ pin <- dbGetQuery( tax_bill_total = tidyr::replace_na(tax_bill_total, 0) ) +# Pull AVs from Athena to fill in any missingness from the legacy system +pin_athena <- dbGetQuery( + ccaoathena, + " + SELECT DISTINCT + pin, + year, + mailed_tot, + certified_tot, + board_tot + FROM default.vw_pin_value + WHERE year >= '2006' + " +) %>% + mutate( + across(c(year, pin), as.character), + across(c(ends_with("_tot")), as.integer) + ) + pin_fill <- pin %>% # There are a few (less than 100) rows with Clerk AVs split for the same PIN. # Sum to get 1 record per PIN, then keep the record with the highest AV @@ -89,6 +113,12 @@ pin_fill <- pin %>% mutate(av_clerk = sum(av_clerk)) %>% ungroup() %>% distinct(year, pin, .keep_all = TRUE) %>% + left_join(pin_athena, by = c("year", "pin")) %>% + mutate( + av_board = ifelse(is.na(av_board), board_tot, av_board), + av_certified = ifelse(is.na(av_certified), certified_tot, av_certified), + av_mailed = ifelse(is.na(av_mailed), mailed_tot, av_mailed) + ) %>% # A few (less than 500) values are missing from the mailed assessment stage # AV column. We can replace any missing mailed value with certified value # from the same year. Only 2 board/certified values are missing, and both are @@ -97,7 +127,8 @@ pin_fill <- pin %>% av_board = ifelse(is.na(av_board), 0L, av_board), av_certified = ifelse(is.na(av_certified), 0L, av_certified), av_mailed = ifelse(is.na(av_mailed), av_certified, av_mailed) - ) + ) %>% + select(-ends_with("_tot")) # Write to S3 arrow::write_dataset( diff --git a/data-raw/sample_tax_bills/2022_200_04261010740000.pdf b/data-raw/sample_tax_bills/2022_200_04261010740000.pdf new file mode 100644 index 0000000..336ca65 Binary files /dev/null and b/data-raw/sample_tax_bills/2022_200_04261010740000.pdf differ diff --git a/data-raw/sample_tax_bills/2022_202_28244220220000.pdf b/data-raw/sample_tax_bills/2022_202_28244220220000.pdf new file mode 100644 index 0000000..5b6eff5 Binary files /dev/null and b/data-raw/sample_tax_bills/2022_202_28244220220000.pdf differ diff --git a/data-raw/sample_tax_bills/2022_203_19063120380000.pdf b/data-raw/sample_tax_bills/2022_203_19063120380000.pdf new file mode 100644 index 0000000..b8617fa Binary files /dev/null and b/data-raw/sample_tax_bills/2022_203_19063120380000.pdf differ diff --git a/data-raw/sample_tax_bills/2022_204_02171060120000.pdf b/data-raw/sample_tax_bills/2022_204_02171060120000.pdf new file mode 100644 index 0000000..8eccf17 Binary files /dev/null and b/data-raw/sample_tax_bills/2022_204_02171060120000.pdf differ diff --git a/data-raw/sample_tax_bills/2022_205_10252080490000.pdf b/data-raw/sample_tax_bills/2022_205_10252080490000.pdf new file mode 100644 index 0000000..f16c5ff Binary files /dev/null and b/data-raw/sample_tax_bills/2022_205_10252080490000.pdf differ diff --git a/data-raw/sample_tax_bills/2022_211_14333001380000.pdf b/data-raw/sample_tax_bills/2022_211_14333001380000.pdf new file mode 100644 index 0000000..6aedebc Binary files /dev/null and b/data-raw/sample_tax_bills/2022_211_14333001380000.pdf differ diff --git a/data-raw/sample_tax_bills/2022_299_14052110241207.pdf b/data-raw/sample_tax_bills/2022_299_14052110241207.pdf new file mode 100644 index 0000000..4c6b49e Binary files /dev/null and b/data-raw/sample_tax_bills/2022_299_14052110241207.pdf differ diff --git a/data-raw/sample_tax_bills/2022_299_23222000451009.pdf b/data-raw/sample_tax_bills/2022_299_23222000451009.pdf new file mode 100644 index 0000000..bfbd605 Binary files /dev/null and b/data-raw/sample_tax_bills/2022_299_23222000451009.pdf differ diff --git a/data-raw/sample_tax_bills/2022_593_08261020260000.pdf b/data-raw/sample_tax_bills/2022_593_08261020260000.pdf new file mode 100644 index 0000000..633e916 Binary files /dev/null and b/data-raw/sample_tax_bills/2022_593_08261020260000.pdf differ diff --git a/data-raw/sample_tax_bills/agency_name_match.csv b/data-raw/sample_tax_bills/agency_name_match.csv index 59327c1..35c6c5b 100644 --- a/data-raw/sample_tax_bills/agency_name_match.csv +++ b/data-raw/sample_tax_bills/agency_name_match.csv @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:d76e4471b3e6b34eb199abe6e2db6887609b451361bb45d68f918172f69dc749 -size 7254 +oid sha256:fc23b4e2eb9c7d1ebca963e2fae1842ad469e51183146e73f5971edfbcc54737 +size 8945 diff --git a/data-raw/sample_tax_bills/sample_tax_bills_detail.R b/data-raw/sample_tax_bills/sample_tax_bills_detail.R index e4d9d53..2566ba2 100644 --- a/data-raw/sample_tax_bills/sample_tax_bills_detail.R +++ b/data-raw/sample_tax_bills/sample_tax_bills_detail.R @@ -1,6 +1,6 @@ library(dplyr) library(tidyr) -library(tabulizer) +library(pdftools) library(miniUI) library(stringr) library(purrr) @@ -25,19 +25,46 @@ row_to_names <- function(df) { # Different tax bills can have different table sizes depending on the number of -# taxing district. As such, the table bottom boundary will be different for each -# bill. Here we manually specify the area of table using an interactive widget +# taxing district. extract_tax_bill <- function(file) { base_file <- basename(file) - - # Scan table into memory - tbl <- tabulizer::extract_areas(file = file, pages = 1)[[1]] %>% - as_tibble() %>% - row_to_names() %>% - set_names( - c("agency_name", "final_tax", "rate", "percent", "pension", "prev_tax") + tbl <- pdf_text(file)[[1]] %>% + str_extract(., regex("MISCELLANEOUS TAXES.*", dotall = TRUE)) %>% + str_split(., "\n") %>% + unlist() %>% + tibble(vals = `.`) %>% + mutate(vals = str_replace_all(vals, "[:space:]{2,}", "\t")) %>% + separate_wider_delim( + col = vals, + names = c( + "agency_name", "final_tax", "rate", "percent", + "pension", "prev_tax" + ), + delim = "\t", too_few = "align_start", too_many = "drop" + ) %>% + mutate( + agency_name = str_squish(agency_name), + flag = is.na(prev_tax), + prev_tax = if_else(flag, + pension, + prev_tax + ), + pension = if_else(flag, + NA, + pension + ) + ) %>% + select(-flag) %>% + filter( + agency_name != "", + !str_detect( + agency_name, + paste0( + "TAXES|Assess|Property|EAV|Local Tax|", + "Total Tax|Do not|Equalizer|cookcountyclerk.com" + ) + ) ) - # Create a list with metadata for output out <- list( year = str_sub(base_file, 1, 4), @@ -85,14 +112,15 @@ bills_df <- bills_df %>% group_by(pin, year, agency_num) %>% mutate(across(final_tax:prev_tax, sum)) %>% select(-cook) %>% - filter(!is.na(agency_num), row_number() == 1) %>% + filter(!is.na(agency_num), name_priority == 1) %>% + select(-name_priority) %>% ungroup() # Round numeric values to nearest hundredth bills_df <- bills_df %>% mutate( - across(c(final_tax, percent, pension, prev_tax), round, 2), - across(c(rate), round, 3), + across(c(final_tax, percent, pension, prev_tax), ~ round(.x, 2)), + across(c(rate), ~ round(.x, 3)), ) # Write detail results to file for safekeeping diff --git a/data-raw/sample_tax_bills/sample_tax_bills_detail.csv b/data-raw/sample_tax_bills/sample_tax_bills_detail.csv index e89b6dc..ca01b4a 100644 --- a/data-raw/sample_tax_bills/sample_tax_bills_detail.csv +++ b/data-raw/sample_tax_bills/sample_tax_bills_detail.csv @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:5f5a9f5fea3b6a41cfac65661a95d4bb0699c17e0da67d25792739db5d93022c -size 47317 +oid sha256:1968dafd92ca407c4247ae4bfcd63d6380749e90b566f1cb92170dcc4e36bec5 +size 57271 diff --git a/data-raw/sample_tax_bills/sample_tax_bills_summary.csv b/data-raw/sample_tax_bills/sample_tax_bills_summary.csv index b6c95a0..b5560f0 100644 --- a/data-raw/sample_tax_bills/sample_tax_bills_summary.csv +++ b/data-raw/sample_tax_bills/sample_tax_bills_summary.csv @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:eae93e3cfc8d5396e424f4bccf11468f685e16ece56487c03a73c374e52bfda6 -size 4948 +oid sha256:58df45909edc638165005622d8e9eec435866f154efbe1519067906e59ce9c6e +size 5976 diff --git a/data-raw/tax_code/2022 Tax Code Agency Rate.xlsx b/data-raw/tax_code/2022 Tax Code Agency Rate.xlsx new file mode 100644 index 0000000..9b55b61 --- /dev/null +++ b/data-raw/tax_code/2022 Tax Code Agency Rate.xlsx @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:eee13315554b2c096e6513c2ff13e0cb178b9c46473f2ddd414bd705f52102bb +size 1892319 diff --git a/data-raw/tax_code/tax_code.R b/data-raw/tax_code/tax_code.R index e9dd268..562ea02 100644 --- a/data-raw/tax_code/tax_code.R +++ b/data-raw/tax_code/tax_code.R @@ -23,6 +23,7 @@ file_names <- list.files( # Load each file and cleanup columns, then combine into single df tax_code <- map_dfr(file_names, function(file) { # Extract year from file name + print(file) year_ext <- str_extract(file, "\\d{4}") # Load file based on extension @@ -39,6 +40,14 @@ tax_code <- map_dfr(file_names, function(file) { ~ str_replace(.x, "taxcode", "tax_code"), starts_with("taxcode") ) %>% + rename_with( + ~ str_replace(.x, "ag_rate", "agency_rate"), + starts_with("ag_rate") + ) %>% + rename_with( + ~ str_replace(.x, "code_rate", "tax_code_rate"), + starts_with("code_rate") + ) %>% mutate( year = as.character(year_ext), agency_rate = as.numeric(agency_rate), diff --git a/data-raw/tif/distribution/2022 TIF Agency Distribution Report.xlsx b/data-raw/tif/distribution/2022 TIF Agency Distribution Report.xlsx new file mode 100644 index 0000000..a6e75f2 --- /dev/null +++ b/data-raw/tif/distribution/2022 TIF Agency Distribution Report.xlsx @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5a60312f8c7a924bc88cb6728930027e6e512c89b1f811fe8964d5ea2f3d6cc3 +size 170449 diff --git a/data-raw/tif/main/2022 Cook County TIF Summary.xlsx b/data-raw/tif/main/2022 Cook County TIF Summary.xlsx new file mode 100644 index 0000000..186f0a2 --- /dev/null +++ b/data-raw/tif/main/2022 Cook County TIF Summary.xlsx @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5bfe861c1e1dd5a15c0398c40aed22023379fa114876186d530d46426e157641 +size 45095 diff --git a/data-raw/tif/tif.R b/data-raw/tif/tif.R index a4b3cac..47a28bd 100644 --- a/data-raw/tif/tif.R +++ b/data-raw/tif/tif.R @@ -123,8 +123,8 @@ tif_main_pdf <- map_dfr(summ_file_names_pdf, function(file) { "curr_year_revenue", "prev_year_revenue", "pct_diff" )) %>% filter(agency_num != "AGENCY") %>% - na_if("-") %>% - na_if("") %>% + mutate(across(where(is.character), ~ na_if(.x, "-"))) %>% + mutate(across(where(is.character), ~ na_if(.x, ""))) %>% mutate( year = year_ext, agency_num = str_pad( diff --git a/data/sample_tax_bills_detail.rda b/data/sample_tax_bills_detail.rda index 413c396..e47e3f7 100644 Binary files a/data/sample_tax_bills_detail.rda and b/data/sample_tax_bills_detail.rda differ diff --git a/data/sample_tax_bills_summary.rda b/data/sample_tax_bills_summary.rda index c9c2851..b08335a 100644 Binary files a/data/sample_tax_bills_summary.rda and b/data/sample_tax_bills_summary.rda differ diff --git a/tests/testthat/test-lookup.R b/tests/testthat/test-lookup.R index 8054fcb..ed7d0fe 100644 --- a/tests/testthat/test-lookup.R +++ b/tests/testthat/test-lookup.R @@ -232,7 +232,7 @@ test_that("lookup values/data are correct", { ) expect_known_hash( lookup_agency(sum_df$year, sum_df$tax_code), - "b0b8d1fbca" + "c4d062201d" ) }) diff --git a/tests/testthat/test-tax_bill.R b/tests/testthat/test-tax_bill.R index a489c6d..16250c1 100644 --- a/tests/testthat/test-tax_bill.R +++ b/tests/testthat/test-tax_bill.R @@ -140,12 +140,12 @@ test_that("returned amount/output correct for all sample bills", { expect_equal( tax_bill(sum_dt$year, sum_dt$pin, simplify = FALSE) %>% nrow(), - 508 + 621 ) expect_equal( tax_bill(sum_dt$year, sum_dt$pin, simplify = TRUE) %>% nrow(), - 525 + 639 ) # District level tax amounts @@ -164,7 +164,14 @@ test_that("returned amount/output correct for all sample bills", { # Exclude certain PINs in the RPM TIF or with extremely high bills # Will run separate tests for these sum_dt_no_rpm <- sum_dt %>% - filter(!pin %in% c("14174100180000", "01363010130000")) + filter(!pin %in% c( + "14174100180000", + "01363010130000", + "14333001380000", + "10252080490000" + # TODO: This last PIN has an exemption on its 2022 bill but not in the + # 2022 clerk data. Seems like a new parcel, need to investigate further + )) test_that("all differences are less than $25", { expect_true(