-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathstatistical_analysis.Rmd
283 lines (215 loc) · 13.4 KB
/
statistical_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
title: "Statistical Analyses"
output:
html_document:
toc: true
toc_float:
collapsed: false
---
<style>
.list-group-item.active, .list-group-item.active:focus, .list-group-item.active:hover {
background-color: #337ab7;
}
.navbar-default .navbar-collapse, .navbar-default .navbar-form {
background-color: #337ab7;
}
.navbar-default {
background-color: #337ab7;
}
.navbar-default .navbar-nav>li>a {
color: white;
font-weight: bold;
}
.navbar-default .navbar-brand {
color: white;
font-weight: bold;
}
</style>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
```
```{r, loading_packages}
library(nnet)
library(tidyverse)
library(car)
library(ggpubr)
library(FSA)
library(dunn.test)
library(tools)
library(htmltools)
library(htmlwidgets)
library(rvest)
library(plotly)
```
<br>
## Differences in Length of Crime
<br>
### Introduction
In one of the exploratory analyses, we found that the average length of time of the reported crime (i.e., time between when it started and when it ended) differed for each felony type (sex-related, drug-related, and weapons-related). Results from that exploratory analysis are included below. We see that sex-related felonies last an average of 19.236 days, drug-related felonies last an average of 0.586 days, and weapons-related felonies last an average of 0.282 days.
```{r exp_tidy_crime_data, echo = FALSE}
felony_crime = readRDS(file = "./datasets/sex_drug_weapons.rds")
time_data = felony_crime %>%
mutate(crime_group = forcats::fct_relevel(crime_group, "Drug-Related"),
boro_nm = forcats::fct_relevel(boro_nm, "manhattan")) %>%
janitor::clean_names() %>%
mutate(time_diff2 = (as.numeric(cmplnt_to_dt - cmplnt_fr_dt, units = "days", na.rm = TRUE))) %>%
select(time_diff2, boro_nm, crime_group) %>%
mutate(time_diff2 = if_else(is.na(time_diff2), 0, time_diff2))
time_data %>% rename(`Crime group` = crime_group) %>%
group_by(`Crime group`) %>%
summarise(Count = n(),
`Average time in days` = mean(time_diff2),
`Standard deviation` = sd(time_diff2)) %>% knitr::kable()
```
<br>
We will assess whether this difference is statistically significant. Testing this difference can yield important insights regarding whether victims are disproportionately suffering the burden of certain crimes, in terms of absolute number of days. Finding a significant result may also suggest that certain felonies are more extensive and complicated, potentially more dangerous or violent, or may require more resources to deal with. Lastly, it may indicate that the victims are at a higher risk compared to those involved in other types of felonies.
<br>
### Data and Methods
We first import the data below and clean up the data. We tidy the data so that Manhattan becomes the reference borough and drug-related felonies become the reference felony. Observations with missing information on when the crime ended where assummed to have the ended the same day it started as NYPD recorded information if only there was an end date.
```{r, data_import}
sex_drug_weapons = readRDS(file = "./datasets/sex_drug_weapons.rds")
```
```{r}
test_mean_data = sex_drug_weapons %>%
mutate(crime_group = forcats::fct_relevel(crime_group, "Drug-Related"),
boro_nm = forcats::fct_relevel(boro_nm, "manhattan")) %>%
janitor::clean_names() %>%
mutate(time_diff2 = (as.numeric(cmplnt_to_dt - cmplnt_fr_dt, units = "days", na.rm = TRUE))) %>%
select(time_diff2, boro_nm, crime_group) %>%
mutate(time_diff2 = if_else(is.na(time_diff2), 0, time_diff2))
```
Since we are testing the difference in means between three groups, we should be able to conduct an ANOVA test. To ensure that the assumptions of the ANOVA test are not violated, we test for homogeneity of variances. We used Levene's test to test homogeneity of variances. We see from the results below that the p-value is less than 0.05. Therefore, the assumption of homogenous variances is not met.
```{r levene_test}
leveneTest(time_diff2 ~ crime_group, data = test_mean_data) %>%
broom::tidy() %>%
knitr::kable()
```
We resort to using a non-parametric approach in detecting a difference of means between these three groups by utilizing the Kruskal-Wallis test. Using this test means that we do not make any underlying assumptions about the distribution of the three groups; since these groups potentially come from very different populations, a non-parametric approach seems best for this test.
```{r kruskal_wallis_test}
kruskal.test(time_diff2 ~ crime_group, data = test_mean_data) %>%
broom::tidy() %>%
knitr::kable()
```
<br>
From the test above, it is evident that the p-value is less that 0.05. We conclude that at 5% level of significance, at least one of the average lengths of reported crimes (i.e., time between when the crime started and when it ended) is different among the three felony categories.
Since the Kruskal–Wallis test is significant, we now conduct a post-hoc analysis to determine which crime groups differ from each other level. We use the Dunn test for the post-hoc analysis.
```{r dunn}
dunnTest(time_diff2 ~ crime_group,
data = test_mean_data,
#Method of adjustment - Benjamini-Hochberg
method = "bh")
```
<br>
### Results and Discussion
The results show that there is a statistically significant difference between the average length of reported felonies for sex-related and drug-related felonies (p < 0.005), as well as for sex-related and weapons-related felonies (p <0.005). However, there isn't a significant difference between drug-related and weapons-related felonies (p = 0.4097). All together, we have evidence suggesting that victims of sex-related felonies endure longer incidents, compared to victims of drug-related or weapons-related felonies.
The results of these tests suggest that victims of sex-related felonies endure significantly longer incidents compare to victims of drug-related or weapons-related felonies. This makes sense, considering that sex-related felonies include human trafficking, sexual abuse, and other cases in which the length of the crime is often dragged out compared to those within the other two felonies. Given this result, law enforcement, government, and public health officials should dedicate extra resources to prevent sex-related felonies and ensure that their agencies are adequately prepared and capable to resolve and manage sex-related felonies lasting longer periods of time.
<br>
<br>
## Association Between Borough and Crime Occurrence
<br>
### Introduction
In one of our exploratory analyses, we found that felony rates differed by borough (see below).
```{r exp_tidy_felony_data, include = FALSE}
sex_drug_weapons = readRDS(file = "./datasets/sex_drug_weapons.rds")
url = "https://www.census.gov/quickfacts/fact/table/newyorkcitynewyork,bronxcountybronxboroughnewyork,kingscountybrooklynboroughnewyork,newyorkcountymanhattanboroughnewyork,queenscountyqueensboroughnewyork,richmondcountystatenislandboroughnewyork/PST045217"
nyc_population = read_html(url) %>%
html_nodes(css = "table") %>% .[[1]] %>%
html_table(header = TRUE) %>%
as.tibble() %>%
janitor::clean_names()
names(nyc_population)[1:7] = c("estimate_date", "new_york_city", "bronx", "brooklyn", "manhattan", "queens", "staten_island")
nyc_population = nyc_population %>%
gather(key = boro_nm, value = population, estimate_date:staten_island) %>%
mutate(population = if_else(population == "Population estimates, July 1, 2017, (V2017)", "2017", population),
population = as.numeric(gsub("," , "", population)))
```
```{r exp_overall_felony_rates, echo = FALSE}
grouped_df = sex_drug_weapons %>%
group_by(boro_nm, year) %>%
summarise(number = n())
full = left_join(grouped_df, nyc_population, by = "boro_nm") %>%
mutate(crime_rate = (number/population)*100)
x = full %>%
filter(!is.na(boro_nm)) %>%
ggplot(aes(x = year, y = crime_rate, color = boro_nm)) +
geom_point() + geom_line(size = 1) +
labs(x = "Year",
y = "Felony Rate",
legend) + viridis::scale_color_viridis(
name = "Borough",
discrete = TRUE
) + theme_classic()
ggplotly(x)
```
<br>
Given this result, we decided to formally analyze the association between sex-related, drug-related, and weapons-related felonies and the borough in which the crime occurred; specifically, we looked at whether the occurrence of the felony was higher (or lower) in a particular borough compared to the other boroughs. The importance of examining this association cannot be overstated - determining whether felonies occur more often in a specific borough provides the NYPD and other officials the potential capability of directing their resources toward areas that need them the most.
<br>
### Data and Methods
<br>
##### Model Statement for the Multinomial Logistic Regression Model
We conducted a multinomial logistic regression to study the association between borough and occurrence of drug-related, sex-related, and weapons-related felonies. The model statement for our research question is included below.
<br>
$$ln (\frac{P( crimegroup = sex-related)}{P(crime group = drug-related)}) = \beta_{10} + \beta_{11}(Bronx) + \beta_{12}(Brooklyn) + \beta_{13}(Queens) + \beta_{14}(Staten Island)$$
<br>
$$ln(\frac{P(crimegroup = weapons-related)}{P(crime group = drug-related)}) = \beta_{20} + \beta_{21}(Bronx) + \beta_{22}(Brooklyn) + \beta_{23}(Queens) + \beta_{24}(Staten Island)$$
<br>
Drug-related felonies was chosen as the reference group among the 3 types of felonies because it tends to be the least serious of the 3 crime groups. Among the boroughs, Manhattan was chosen as the reference group because it is widely thought to be the borough with least crime.
```{r reference_groups}
### Making Drug-related Felonies and Manhattan the Reference Groups
sex_drug_weapons = sex_drug_weapons %>%
mutate(crime_group = forcats::fct_relevel(crime_group, "Drug-Related"),
boro_nm = forcats::fct_relevel(boro_nm, "manhattan"))
```
<br>
#### Cross tabulation
We created a data table to organize the rates of drug-related, sex-related, and weapons-related felonies; we stratified the results by borough. From the table, we can see that, in terms of absolute numbers, Brooklyn has the highest incidence of weapons-related and sex-related felonies, while the Bronx has the highest incidence of drug-related felonies.
```{r, felonies_boroughs}
with(sex_drug_weapons, table(boro_nm, crime_group)) %>%
knitr::kable()
```
<br>
Next, we fit a logistic regression model to the data. We also plot the data into a graph.
```{r table_felony_and_boroughs, results = "hide"}
fit = multinom(crime_group ~ boro_nm, data = sex_drug_weapons) %>%
broom::tidy() %>%
mutate(`Odds Ratio` = exp(estimate))
```
```{r tidy_felony_and_boroughs}
clean_fit = fit %>%
mutate(term = str_replace(term, "\\(", ""),
term = str_replace(term, "\\)", ""),
term = str_replace(term, "boro_nm", ""),
term = toTitleCase(term)) %>% mutate(term = str_replace(term, "_", " ")) %>%
rename(`Crime group` = y.level, Term = term, Estimate = estimate, `Standard error` = std.error, `P value` = p.value)
clean_fit %>%
knitr::kable()
```
```{r graph_felony_and_boroughs}
ggplot(clean_fit, aes(x = Term, y = `Odds Ratio`)) +
geom_segment( aes(x = Term, xend = Term, y = 0, yend = `Odds Ratio`), color = "grey") +
geom_point(color = "orange", size = 3) + facet_grid(~ `Crime group`) +
theme_light() +
theme(axis.text.x = element_text(angle = 80, hjust = 1, size = 8),
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank()) +
labs(x = "Borough",
y = "Odds Ratio")
```
<br>
### Results and Discussion
From the data table and the graph above, we see that:
* The odds of a sex-related felony occurring in Manhattan is 1.79 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a sex-related felony occurring in the Bronx is 1.71 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a sex-related felony occurring in Brooklyn is 2.37 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a sex-related felony occurring in Queens is 3.06 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a sex-related felony occurring in Staten Island is 2.11 times the odds of a drug-related felony occurring in Manhattan.
<br>
* The odds of a weapons-related felony occurring in Manhattan is 2.71 times the odds of a drug-related felony occuring in Manhattan.
* The odds of a weapons-related felony occurring in the Bronx is 2.06 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a weapons-related felony occurring in Brooklyn is 4.14 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a weapons-related felony occurring in Queens is 4.22 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a weapons-related felony occurring in Staten Island is 2.62 times the odds of a drug-related felony occurring in Manhattan.
These results suggest a significantly higher odds of a sex-related or a weapons-related felony occurring in all 4 boroughs compared to a drug-related felony occurring in Manhattan. Of note is that a weapons-related felony (OR = 4.22) and a sex-related felony (OR = 3.06) is more likely to occur in Queens compared to a drug-related felony in Manhattan - these odds represent the highest among the boroughs.