statistical_analysis.Rmd

---
title: "Statistical Analyses"
output:
  html_document:
    toc: true
    toc_float:
      collapsed: false
---

<style>
.list-group-item.active, .list-group-item.active:focus, .list-group-item.active:hover {
    background-color: #337ab7;
}

.navbar-default .navbar-collapse, .navbar-default .navbar-form {
background-color: #337ab7;
}

.navbar-default {
background-color: #337ab7;
}

.navbar-default .navbar-nav>li>a {
color: white;
font-weight: bold;
}

.navbar-default .navbar-brand {
color: white;
font-weight: bold;
}

</style>

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
```

```{r, loading_packages}
library(nnet)
library(tidyverse)
library(car)
library(ggpubr)
library(FSA)
library(dunn.test)
library(tools)
library(htmltools)
library(htmlwidgets)
library(rvest)
library(plotly)
```

<br>

## Differences in Length of Crime

<br>

### Introduction
In one of the exploratory analyses, we found that the average length of time of the reported crime (i.e., time between when it started and when it ended) differed for each felony type (sex-related, drug-related, and weapons-related). Results from that exploratory analysis are included below. We see that sex-related felonies last an average of 19.236 days, drug-related felonies last an average of 0.586 days, and weapons-related felonies last an average of 0.282 days.

```{r exp_tidy_crime_data, echo = FALSE}
felony_crime = readRDS(file = "./datasets/sex_drug_weapons.rds")

time_data = felony_crime %>%
  mutate(crime_group = forcats::fct_relevel(crime_group, "Drug-Related"),
         boro_nm = forcats::fct_relevel(boro_nm, "manhattan")) %>% 
  janitor::clean_names() %>% 
  mutate(time_diff2 = (as.numeric(cmplnt_to_dt - cmplnt_fr_dt, units = "days", na.rm = TRUE))) %>% 
  select(time_diff2, boro_nm, crime_group) %>% 
  mutate(time_diff2 = if_else(is.na(time_diff2), 0, time_diff2)) 

time_data %>% rename(`Crime group` = crime_group) %>% 
group_by(`Crime group`) %>%
  summarise(Count = n(), 
            `Average time in days` = mean(time_diff2),
             `Standard deviation` = sd(time_diff2)) %>% knitr::kable()
```

<br>

We will assess whether this difference is statistically significant. Testing this difference can yield important insights regarding whether victims are disproportionately suffering the burden of certain crimes, in terms of absolute number of days. Finding a significant result may also suggest that certain felonies are more extensive and complicated, potentially more dangerous or violent, or may require more resources to deal with. Lastly, it may indicate that the victims are at a higher risk compared to those involved in other types of felonies.

<br> 

### Data and Methods
We first import the data below and clean up the data. We tidy the data so that Manhattan becomes the reference borough and drug-related felonies become the reference felony. Observations with missing information on when the crime ended where assummed to have the ended the same day it started as NYPD recorded information if only there was an end date. 

```{r, data_import}
sex_drug_weapons = readRDS(file = "./datasets/sex_drug_weapons.rds")
```

```{r}
test_mean_data = sex_drug_weapons %>%
  mutate(crime_group = forcats::fct_relevel(crime_group, "Drug-Related"),
         boro_nm = forcats::fct_relevel(boro_nm, "manhattan")) %>% 
  janitor::clean_names() %>% 
  mutate(time_diff2 = (as.numeric(cmplnt_to_dt - cmplnt_fr_dt, units = "days", na.rm = TRUE))) %>% 
  select(time_diff2, boro_nm, crime_group) %>% 
  mutate(time_diff2 = if_else(is.na(time_diff2), 0, time_diff2))
```

Since we are testing the difference in means between three groups, we should be able to conduct an ANOVA test. To ensure that the assumptions of the ANOVA test are not violated, we test for homogeneity of variances. We used Levene's test to test homogeneity of variances. We see from the results below that the p-value is less than 0.05. Therefore, the assumption of homogenous variances is not met. 
```{r levene_test}
 leveneTest(time_diff2 ~ crime_group, data = test_mean_data) %>% 
  broom::tidy() %>% 
  knitr::kable()
```

We resort to using a non-parametric approach in detecting a difference of means between these three groups by utilizing the Kruskal-Wallis test. Using this test means that we do not make any underlying assumptions about the distribution of the three groups; since these groups potentially come from very different populations, a non-parametric approach seems best for this test.
```{r kruskal_wallis_test}
kruskal.test(time_diff2 ~ crime_group, data = test_mean_data) %>% 
  broom::tidy() %>% 
  knitr::kable()
```

<br>

From the test above, it is evident that the p-value is less that 0.05. We conclude that at 5% level of significance, at least one of the average lengths of reported crimes (i.e., time between when the crime started and when it ended) is different among the three felony categories.

Since the Kruskal–Wallis test is significant, we now conduct a post-hoc analysis to determine which crime groups differ from each other level. We use the Dunn test for the post-hoc analysis. 
```{r dunn}
dunnTest(time_diff2 ~ crime_group,
              data = test_mean_data,
              #Method of adjustment - Benjamini-Hochberg               
              method = "bh")  
```

<br>

### Results and Discussion
The results show that there is a statistically significant difference between the average length of reported felonies for sex-related and drug-related felonies (p < 0.005), as well as for sex-related and weapons-related felonies (p <0.005). However, there isn't a significant difference between drug-related and weapons-related felonies (p = 0.4097). All together, we have evidence suggesting that victims of sex-related felonies endure longer incidents, compared to victims of drug-related or weapons-related felonies.

The results of these tests suggest that victims of sex-related felonies endure significantly longer incidents compare to victims of drug-related or weapons-related felonies. This makes sense, considering that sex-related felonies include human trafficking, sexual abuse, and other cases in which the length of the crime is often dragged out compared to those within the other two felonies. Given this result, law enforcement, government, and public health officials should dedicate extra resources to prevent sex-related felonies and ensure that their agencies are adequately prepared and capable to resolve and manage sex-related felonies lasting longer periods of time. 

<br>
<br>

## Association Between Borough and Crime Occurrence

<br>

### Introduction
In one of our exploratory analyses, we found that felony rates differed by borough (see below).

```{r exp_tidy_felony_data, include = FALSE}
sex_drug_weapons = readRDS(file = "./datasets/sex_drug_weapons.rds")

url = "https://www.census.gov/quickfacts/fact/table/newyorkcitynewyork,bronxcountybronxboroughnewyork,kingscountybrooklynboroughnewyork,newyorkcountymanhattanboroughnewyork,queenscountyqueensboroughnewyork,richmondcountystatenislandboroughnewyork/PST045217"

nyc_population = read_html(url) %>%  
  html_nodes(css = "table") %>% .[[1]] %>% 
  html_table(header = TRUE) %>% 
  as.tibble() %>% 
  janitor::clean_names()

names(nyc_population)[1:7] = c("estimate_date", "new_york_city", "bronx", "brooklyn", "manhattan", "queens", "staten_island")

nyc_population = nyc_population %>% 
  gather(key = boro_nm, value = population, estimate_date:staten_island) %>% 
  mutate(population = if_else(population == "Population estimates, July 1, 2017, (V2017)", "2017", population),
         population = as.numeric(gsub("," , "", population)))
```

```{r exp_overall_felony_rates, echo = FALSE}
grouped_df = sex_drug_weapons %>% 
  group_by(boro_nm, year) %>% 
  summarise(number = n())

full = left_join(grouped_df, nyc_population, by = "boro_nm") %>% 
  mutate(crime_rate = (number/population)*100)

x = full %>% 
  filter(!is.na(boro_nm)) %>% 
  ggplot(aes(x = year, y = crime_rate, color = boro_nm)) + 
  geom_point() + geom_line(size = 1) +
  labs(x = "Year",
       y = "Felony Rate",
       legend) + viridis::scale_color_viridis(
      name = "Borough", 
      discrete = TRUE
    ) + theme_classic()

ggplotly(x)
```

<br>

Given this result, we decided to formally analyze the association between sex-related, drug-related, and weapons-related felonies and the borough in which the crime occurred; specifically, we looked at whether the occurrence of the felony was higher (or lower) in a particular borough compared to the other boroughs. The importance of examining this association cannot be overstated - determining whether felonies occur more often in a specific borough provides the NYPD and other officials the potential capability of directing their resources toward areas that need them the most.

<br>

### Data and Methods

<br>

##### Model Statement for the Multinomial Logistic Regression Model
We conducted a multinomial logistic regression to study the association between borough and occurrence of drug-related, sex-related, and weapons-related felonies. The model statement for our research question is included below.

<br>

$$ln (\frac{P( crimegroup = sex-related)}{P(crime group = drug-related)}) = \beta_{10} + \beta_{11}(Bronx) + \beta_{12}(Brooklyn) + \beta_{13}(Queens) + \beta_{14}(Staten Island)$$
<br>

$$ln(\frac{P(crimegroup = weapons-related)}{P(crime group = drug-related)}) = \beta_{20} + \beta_{21}(Bronx) + \beta_{22}(Brooklyn) + \beta_{23}(Queens) + \beta_{24}(Staten Island)$$

<br>

Drug-related felonies was chosen as the reference group among the 3 types of felonies because it tends to be the least serious of the 3 crime groups. Among the boroughs, Manhattan was chosen as the reference group because it is widely thought to be the borough with least crime. 
```{r reference_groups}
### Making Drug-related Felonies and Manhattan the Reference Groups
sex_drug_weapons = sex_drug_weapons %>%
  mutate(crime_group = forcats::fct_relevel(crime_group, "Drug-Related"),
         boro_nm = forcats::fct_relevel(boro_nm, "manhattan"))
```

<br>

#### Cross tabulation

We created a data table to organize the rates of drug-related, sex-related, and weapons-related felonies; we stratified the results by borough. From the table, we can see that, in terms of absolute numbers, Brooklyn has the highest incidence of weapons-related and sex-related felonies, while the Bronx has the highest incidence of drug-related felonies.
```{r, felonies_boroughs}
with(sex_drug_weapons, table(boro_nm, crime_group)) %>%
  knitr::kable()
```

<br>

Next, we fit a logistic regression model to the data. We also plot the data into a graph.
```{r table_felony_and_boroughs, results = "hide"}
fit = multinom(crime_group ~ boro_nm, data = sex_drug_weapons) %>% 
  broom::tidy() %>% 
  mutate(`Odds Ratio` = exp(estimate))
```

```{r tidy_felony_and_boroughs}
clean_fit = fit %>% 
            mutate(term = str_replace(term, "\\(", ""),
               term = str_replace(term, "\\)", ""),
               term = str_replace(term, "boro_nm", ""),
               term = toTitleCase(term)) %>% mutate(term = str_replace(term, "_", " ")) %>% 
  rename(`Crime group` = y.level, Term = term, Estimate = estimate, `Standard error` = std.error, `P value` = p.value) 

clean_fit %>% 
  knitr::kable() 
```

```{r graph_felony_and_boroughs}

ggplot(clean_fit, aes(x = Term, y = `Odds Ratio`)) +
  geom_segment( aes(x = Term, xend = Term, y = 0, yend = `Odds Ratio`), color = "grey") +
  geom_point(color = "orange", size = 3) + facet_grid(~ `Crime group`) +
  theme_light() +
  theme(axis.text.x = element_text(angle = 80, hjust = 1, size = 8),
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank()) +
  labs(x = "Borough",
       y = "Odds Ratio")
```

<br>

### Results and Discussion
From the data table and the graph above, we see that:

* The odds of a sex-related felony occurring in Manhattan is 1.79 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a sex-related felony occurring in the Bronx is 1.71 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a sex-related felony occurring in Brooklyn is 2.37 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a sex-related felony occurring in Queens is 3.06 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a sex-related felony occurring in Staten Island is 2.11 times the odds of a drug-related felony occurring in Manhattan.

<br>

* The odds of a weapons-related felony occurring in Manhattan is 2.71 times the odds of a drug-related felony occuring in Manhattan.
* The odds of a weapons-related felony occurring in the Bronx is 2.06 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a weapons-related felony occurring in Brooklyn is 4.14 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a weapons-related felony occurring in Queens is 4.22 times the odds of a drug-related felony occurring in Manhattan.
* The odds of a weapons-related felony occurring in Staten Island is 2.62 times the odds of a drug-related felony occurring in Manhattan.

These results suggest a significantly higher odds of a sex-related or a weapons-related felony occurring in all 4 boroughs compared to a drug-related felony occurring in Manhattan. Of note is that a weapons-related felony (OR = 4.22) and a sex-related felony (OR = 3.06) is more likely to occur in Queens compared to a drug-related felony in Manhattan - these odds represent the highest among the boroughs.