-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.Rmd
383 lines (283 loc) · 25.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
---
title: "Table of Contents"
output:
github_document:
toc: true
toc_depth: 3
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "docs/figures/",
out.width = "100%"
)
```
> :warning: **NOTE** :warning:
>
> The [condominium model](https://github.com/ccao-data/model-condo-avm) (this repo) is nearly identical to the [residential (single/multi-family) model](https://github.com/ccao-data/model-res-avm), with a few [key differences](#differences-compared-to-the-residential-model). Please read the documentation for the [residential model](https://github.com/ccao-data/model-res-avm) first.
# Prior Models
This repository contains code, data, and documentation for the Cook County Assessor's condominium reassessment model. Information about prior year models can be found at the following links:
| Year(s) | Triad(s) | Method | Language / Framework | Link |
|---------|----------|---------------------------------------------|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
| 2015 | City | N/A | SPSS | [Link](https://gitlab.com/ccao-data-science---modeling/ccao_sf_cama_dev/-/tree/master/code.legacy/2015%20City%20Tri/2015%20Condo%20Models) |
| 2018 | City | N/A | N/A | Not available. Values provided by vendor |
| 2019 | North | Linear regression or GBM model per township | R (Base) | [Link](https://gitlab.com/ccao-data-science---modeling/ccao_sf_cama_dev) |
| 2020 | South | Linear regression or GBM model per township | R (Base) | [Link](https://gitlab.com/ccao-data-science---modeling/ccao_sf_cama_dev) |
| 2021 | City | County-wide LightGBM model | R (Tidyverse / Tidymodels) | [Link](https://github.com/ccao-data/model-condo-avm/tree/2021-assessment-year) |
| 2022 | North | County-wide LightGBM model | R (Tidyverse / Tidymodels) | [Link](https://github.com/ccao-data/model-condo-avm/tree/2022-assessment-year) |
| 2023 | South | County-wide LightGBM model | R (Tidyverse / Tidymodels) | [Link](https://github.com/ccao-data/model-condo-avm/tree/2023-assessment-year) |
| 2024 | City | County-wide LightGBM model | R (Tidyverse / Tidymodels) | [Link](https://github.com/ccao-data/model-condo-avm/tree/2024-assessment-year) |
# Model Overview
The duty of the Cook County Assessor's Office is to value property in a fair, accurate, and transparent way. The Assessor is committed to transparency throughout the assessment process. As such, this document contains:
* [A description of the differences between the residential model and this (condominium) model](#differences-compared-to-the-residential-model)
* [An outline of ongoing issues specific to condominium assessments](#ongoing-issues)
The repository itself contains the [code](./pipeline) for the Automated Valuation Model (AVM) used to generate initial assessed values for all condominium properties in Cook County. This system is effectively an advanced machine learning model (hereafter referred to as "the model"). It uses previous sales to generate estimated sale values (assessments) for all properties.
## Differences Compared to the Residential Model
The Cook County Assessor's Office has started to track a limited number of characteristics (building-level square footage, unit-level square footage, bedrooms, and bathrooms) for condominiums, but the data we have ***varies in both the characteristics available and their completeness*** between triads. Staffing limitations have forced the office to prioritize smaller condo buildings less likely to have recent unit sales in certain parts of the county.
Like most assessors nationwide, our office staff cannot enter buildings to observe property characteristics. For condos, this means we cannot observe amenities, quality, or any other interior characteristics which must instead be gathered from listings and a number of additional third-party sources.
The only _complete_ information our office currently has about individual condominium units is their age, location, sale date/price, and percentage of ownership. This makes modeling condos particularly challenging, as the number of usable features is quite small. Fortunately, condos have two qualities which make modeling a bit easier:
1. Condos are more homogeneous than single/multi-family properties, i.e. the range of potential condo sale prices is much narrower.
2. Condo are pre-grouped into clusters of like units (buildings), and units within the same building usually have similar sale prices.
We leverage these qualities to produce what we call ***strata***, a feature unique to the condo model. See [Condo Strata](#condo-strata) for more information about how strata is used and calculated.
### Features Used
Because our individual condo unit characteristics are sparse and incomplete, we primarily must rely on aggregate geospatial features, economic features, [strata](#condo-strata), and time of sale to determine condo assessed values. The features in the table below are the ones used in the most recent assessment model.
```{r features_used, message=FALSE, echo=FALSE}
library(dplyr)
library(glue)
library(jsonlite)
library(purrr)
library(readr)
library(tidyr)
library(yaml)
condo_params <- read_yaml("params.yaml")
condo_preds <- as_tibble(condo_params$model$predictor$all)
# Some values are derived in the model itself, so they are not documented
# in the dbt DAG and need to be documented here
# nolint start
hardcoded_descriptions <- tribble(
~"column", ~"description",
"sale_year", "Sale year calculated as the number of years since 0 B.C.E",
"sale_day",
"Sale day calculated as the number of days since January 1st, 1997",
"sale_quarter_of_year", "Character encoding of quarter of year (Q1 - Q4)",
"sale_month_of_year", "Character encoding of month of year (Jan - Dec)",
"sale_day_of_year", "Numeric encoding of day of year (1 - 365)",
"sale_day_of_month", "Numeric encoding of day of month (1 - 31)",
"sale_day_of_week", "Numeric encoding of day of week (1 - 7)",
"sale_post_covid", "Indicator for whether sale occurred after COVID-19 was widely publicized (around March 15, 2020)",
"strata_1",
glue("Condominium Building Strata - {condo_params$input$strata$k_1} Levels"),
"strata_2",
glue("Condominium Building Strata - {condo_params$input$strata$k_2} Levels")
)
# nolint end
# Load the dbt DAG from our prod docs site
dbt_manifest <- fromJSON(
"https://ccao-data.github.io/data-architecture/manifest.json"
)
# nolint start: cyclomp_linter
get_column_description <- function(colname, dag_nodes, hardcoded_descriptions) {
# Retrieve the description for a column `colname` either from a set of
# dbt DAG nodes (`dag_nodes`) or a set of hardcoded descriptions
# (`hardcoded_descriptions`). Column descriptions that come from dbt DAG nodes
# will be truncated starting from the first period to reflect the fact that
# we use periods in our dbt documentation to separate high-level column
# summaries from their detailed notes
#
# Prefer the hardcoded descriptions, if they exist
if (colname %in% hardcoded_descriptions$column) {
return(
hardcoded_descriptions[
match(colname, hardcoded_descriptions$column),
]$description
)
}
# If no hardcoded description exists, fall back to checking the dbt DAG
for (node_name in ls(dag_nodes)) {
node <- dag_nodes[[node_name]]
for (column_name in ls(node$columns)) {
if (column_name == colname) {
description <- node$columns[[column_name]]$description
if (!is.null(description) && trimws(description) != "") {
# Strip everything after the first period, since we use the first
# period as a delimiter separating a column's high-level summary from
# its detailed notes in our dbt docs
summary_description <- strsplit(description, ".", fixed = TRUE)[[1]][1]
return(gsub("\n", " ", summary_description))
}
}
}
}
# No match in either the hardcoded descriptions or the dbt DAG, so fall
# back to an empty string
return("")
}
# nolint end
# Make a vector of column descriptions that we can add to the param tibble
# as a new column
param_notes <- condo_preds$value %>%
ccao::vars_rename(names_from = "model", names_to = "athena") %>%
map(~ get_column_description(
.x, dbt_manifest$nodes, hardcoded_descriptions
)) %>%
unlist()
res_params <- read_yaml(
"https://raw.githubusercontent.com/ccao-data/model-res-avm/master/params.yaml"
)
res_preds <- res_params$model$predictor$all
condo_unique_preds <- setdiff(condo_preds$value, res_preds)
condo_preds_fmt <- condo_preds %>%
mutate(description = param_notes) %>%
left_join(
ccao::vars_dict,
by = c("value" = "var_name_model")
) %>%
distinct(
feature_name = var_name_pretty,
variable_name = value,
description,
category = var_type,
type = var_data_type
) %>%
mutate(
category = recode(
category,
char = "Characteristic", acs5 = "ACS5", loc = "Location",
prox = "Proximity", ind = "Indicator", time = "Time",
meta = "Meta", other = "Other", ccao = "Other", shp = "Parcel Shape"
),
feature_name = recode(
feature_name,
"Tieback Proration Rate" = "Condominium % Ownership",
"Year Built" = "Condominium Building Year Built"
),
unique_to_condo_model = ifelse(
variable_name %in% condo_unique_preds |
feature_name %in%
c("Condominium Building Year Built", "Condominium % Ownership"),
TRUE, FALSE
)
) %>%
arrange(desc(unique_to_condo_model), category)
condo_preds_fmt %>%
write_csv("docs/data-dict.csv")
condo_preds_fmt %>%
mutate(unique_to_condo_model = ifelse(unique_to_condo_model, "X", "")) %>%
rename(
"Feature Name" = "feature_name",
"Variable Name" = "variable_name",
"Description" = "description",
"Category" = "category",
"Type" = "type",
"Unique to Condo Model" = "unique_to_condo_model"
) %>%
knitr::kable(format = "markdown")
```
We maintain a few useful resources for working with these features:
- Once you've [pulled the input data](#getting-data), you can inner join the data to the CSV version of the data dictionary ([`docs/data-dict.csv`](./docs/data-dict.csv)) to filter for only the features that we use in the model.
- You can browse our [data catalog](https://ccao-data.github.io/data-architecture/#!/overview) to see more details about these features, in particular the [condo model input view](https://ccao-data.github.io/data-architecture/#!/model/model.ccao_data_athena.model.vw_pin_condo_input) which is the source of our training data.
- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/) or its [Python equivalent](https://ccao-data.github.io/ccao/python/) to programmatically convert variable names to their human-readable versions ([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html)) or convert numerically-encoded variables to human-readable values ([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html). The [`ccao::vars_dict` object](https://ccao-data.github.io/ccao/reference/vars_dict.html) is also useful for inspecting the raw crosswalk that powers the rename and recode functions.
### Valuation
For the most part, condos are valued the same way as single- and multi-family residential property. We [train a model](https://github.com/ccao-data/model-res-avm#how-it-works) using individual condo unit sales, predict the value of all units, and then apply any [post-modeling adjustment](https://github.com/ccao-data/model-res-avm#post-modeling).
However, because the CCAO has so [little information about individual units](#differences-compared-to-the-residential-model), we must rely on the [condominium percentage of ownership](#features-used) to differentiate between units in a building. This feature is effectively the proportion of the building's overall value held by a unit. It is created when a condominium declaration is filed with the County (usually by the developer of the building). The critical assumption underlying the condo valuation process is that percentage of ownership correlates with the relative market value differences between units.
Percentage of ownership is used in two ways:
1. It is used directly as a predictor/feature in the regression model to estimate differing unit values within the same building.
2. It is used to reapportion unit values directly i.e. the value of a unit is ultimately equal to `% of ownership * total building value`.
Visually, this looks like:
![](docs/figures/valuation_perc_owner.png)
For what the office terms "nonlivable" spaces — parking spaces, storage space, and common area — the breakout of value works differently. See [this excel sheet](docs/spreadsheets/condo_nonlivable_demo.xlsx) for an interactive example of how nonlivable spaces are valued based on the total value of a building's livable space.
Percentage of ownership is the single most important feature in the condo model. It determines almost all intra-building differences in unit values.
### Multi-PIN Sales
The condo model is trained on a select number of "multi-PIN sales" (or "multi-sales") in addition to single-parcel sales. Multi-sales are sales that include more than one parcel. In the case of condominiums, many units are sold bundled with deeded parking spaces that are separate parcels. These two-parcel sales are highly reflective of the unit's actual market price. We split the total value of these two-parcel sales according to their relative percent of ownership before using them for training. For example, for a \$100,000 sale of a unit (4% ownership) and a parking space (1% ownership), the sale would be adjusted to \$80,000:
$$\frac{0.04}{0.04 + 0.01} * \$100,000 = \$80,000$$
## Condo Strata
The condo model uses an engineered feature called *strata* to deliver much of its predictive power. Strata is the binned, time-weighted, 5-year average sale price of the building. There are two strata features used in the model, one with `r condo_params$input$strata$k_1` bins and one with `r condo_params$input$strata$k_2` bins. Buildings are binned across each triad using either quantiles or 1-dimensional k-means. A visual representation of quantile-based strata binning looks like:
![](docs/figures/strata.png)
To put strata in more concrete terms, the table below shows a sample 5-level strata. Each condominium unit would be assigned a strata from this table (Strata 1, Strata 2, etc.) based on the 5-year weighted average sale price of its building. All units in a building will have the same strata.
```{r strata, echo=FALSE}
library(tibble)
tribble(
~"Strata", ~"Range of 5-year Average Sale Price",
"Strata 1", "$0 - $121K",
"Strata 2", "$121K - $149K",
"Strata 3", "$149K - $199K",
"Strata 4", "$199K - $276K",
"Strata 5", "$276K+"
) %>%
knitr::kable(format = "markdown")
```
Some additional notes on strata:
- Strata is calculated in the [ingest stage](./pipeline/00-ingest.R) of this repository.
- Calculating the 5-year average sale price of a building requires at least 1 sale. Buildings with no sales have their strata imputed via KNN (using year built, number of units, and location as features).
- Number of bins (`r condo_params$input$strata$k_1` and `r condo_params$input$strata$k_2`) was chosen based on model performance. These numbers yielded the lowest root mean-squared error (RMSE).
# Ongoing Issues
The CCAO faces a number of ongoing issues specific to condominium modeling. We are currently working on processes to fix these issues. We list the issues here for the sake of transparency and to provide a sense of the challenges we face.
### Unit Heterogeneity
The current modeling methodology for condominiums makes two assumptions:
1. Condos units within the same building are similar and will sell for similar amounts.
2. If units are not similar, the percentage of ownership will accurately reflect and be proportional to any difference in value between units.
The model process works even in heterogeneous buildings as long as assumption 2 is met. For example, imagine a building with 8 identical units and 1 penthouse unit. This building violates assumption 1 because the penthouse unit is likely larger and worth more than the other 10. However, if the percentage of ownership of each unit is roughly proportional to its value, then each unit will still receive a fair assessment.
However, the model can produce poor results when both of these assumptions are violated. For example, if a building has an extreme mix of different units, each with the same percentage of ownership, then smaller, less expensive units will be overvalued and larger, more expensive units will be undervalued.
This problem is rare, but does occur in certain buildings with many heterogeneous units. Such buildings typically go through a process of secondary review to ensure the accuracy of the individual unit values.
### Buildings With Few Sales
The condo model relies on sales within the same building to calculate [strata](#condo-strata). This method works well for large buildings with many sales, but can break down when there are only 1 or 2 sales in a building. The primary danger here is _unrepresentative_ sales, i.e. sales that deviate significantly from the real average value of a building's units. When this happens, buildings can have their average unit sale value pegged too high or low.
Fortunately, buildings without any recent sales are relatively rare, as condos have a higher turnover rate than single and multi-family property. Smaller buildings with low turnover are the most likely to not have recent sales.
### Buildings Without Sales
When no sales have occurred in a building in the 5 years prior to assessment, the building's strata features are imputed. The model will look at nearby buildings that have similar unit counts/age and then try to assign an appropriate strata to the target building.
Most of the time, this technique produces reasonable results. However, buildings without sales still go through an additional round of review to ensure the accuracy of individual unit values.
# FAQs
**Note:** The FAQs listed here are for condo-specific questions. See the residential model documentation for [more general FAQs](https://github.com/ccao-data/model-res-avm#faqs).
**Q: What are the most important features in the condo model?**
As with the [residential model](https://github.com/ccao-data/model-res-avm), the importance of individual features varies by location and time. However, generally speaking, the most important features are:
* Location, location, location. Location is the largest driver of county-wide variation in condo value. We account for location using [geospatial features like neighborhood](#features-used).
* Condo percentage of ownership, which determines the intra-building variation in unit price.
* [Condo building strata](#condo-strata). Strata provides us with a good estimate of the average sale price of a building's units.
**Q: How do I see my condo building's strata?**
Individual building [strata](#condo-strata) are not included with assessment notices or shown on the CCAO's website. However, strata *are* stored in the sample data included in this repository. You can load the data ([`input/condo_strata_data.parquet`](./input/condo_strata_data.parquet)) using R and the `read_parquet()` function from the `arrow` library.
**Q: How do I see the assessed value of other units in my building?**
You can use the [CCAO's Address Search](https://www.cookcountyassessor.com/address-search#address) to see all the PINs and values associated with a specific condominium building, simply leave the `Unit Number` field blank when submitting a search.
**Q: How do I view my unit's percentage of ownership?**
The percentage of ownership for individual units is printed on assessment notices. You may also be able to find it via your building's board or condo declaration.
# Usage
Installation and usage of this model is identical to the [installation and usage of the residential model](https://github.com/ccao-data/model-res-avm#usage). Please follow the instructions listed there.
## Getting Data
The data required to run these scripts is produced by the [ingest stage](pipeline/00-ingest.R), which uses SQL pulls from the CCAO's Athena database as a primary data source. CCAO employees can run the ingest stage or pull the latest version of the input data from our internal DVC store using:
```bash
dvc pull
```
Public users can download data for each assessment year using the links below. Each file should be placed in the `input/` directory prior to running the model pipeline.
#### 2021
- [assmntdata.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2021/assmntdata.parquet)
- [modeldata.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2021/modeldata.parquet)
#### 2022
- [assessment_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2022/assessment_data.parquet)
- [condo_strata_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2022/condo_strata_data.parquet)
- [land_nbhd_rate_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2022/land_nbhd_rate_data.parquet)
- [training_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2022/training_data.parquet)
#### 2023
- [assessment_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2023/assessment_data.parquet)
- [condo_strata_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2023/condo_strata_data.parquet)
- [land_nbhd_rate_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2023/land_nbhd_rate_data.parquet)
- [training_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2023/training_data.parquet)
#### 2024
Due to a [data issue](https://github.com/ccao-data/data-architecture/pull/334) with the initial 2024 model run, there are actually _two_ final 2024 models. The run `2024-02-16-silly-billy` was used for Rogers Park only, while the run `2024-03-11-pensive-manasi` was used for all subsequent City of Chicago townships.
The data issue caused some sales to be omitted from the `2024-02-16-silly-billy` training set, however the actual impact on predicted values was _extremely_ minimal. We chose to update the data and create a second final model out of an abundance of caution, and, given low transaction volume in 2023, to include as many arms-length transactions in the training set as possible.
##### 2024-02-16-silly-billy
- [assessment_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-02-16-silly-billy/assessment_data.parquet)
- [char_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-02-16-silly-billy/char_data.parquet)
- [condo_strata_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-02-16-silly-billy/condo_strata_data.parquet)
- [land_nbhd_rate_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-02-16-silly-billy/land_nbhd_rate_data.parquet)
- [training_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-02-16-silly-billy/training_data.parquet)
##### 2024-03-11-pensive-manasi (final)
- [assessment_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-03-11-pensive-manasi/assessment_data.parquet)
- [char_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-03-11-pensive-manasi/char_data.parquet)
- [condo_strata_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-03-11-pensive-manasi/condo_strata_data.parquet)
- [land_nbhd_rate_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-03-11-pensive-manasi/land_nbhd_rate_data.parquet)
- [training_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-03-11-pensive-manasi/training_data.parquet)
For other data from the CCAO, please visit the [Cook County Data Portal](https://datacatalog.cookcountyil.gov/).
# License
Distributed under the AGPL-3 License. See [LICENSE](./LICENSE) for more information.
# Contributing
We welcome pull requests, comments, and other feedback via GitHub. For more involved collaboration or projects, please see the [Developer Engagement Program](https://github.com/ccao-data/people#external) documentation on our group wiki.