abstraction.Rmd

```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

# Data Abstraction

At this point, we've learned the basics of working with the R language. From here we'll want to explore how to analyze data, both statistically and spatially. One part of this is abstracting information from existing data sets by selecting variables and observations and summarizing their statistics. 

Some useful methods for data abstraction can be found in the various packages of "The Tidyverse" [@wickham2017tidyverse] which can be included all at once with the **`tidyverse`** package. We'll start with **`dplyr`**, which includes an array of data manipulation tools, including **`select`** for selecting variables, **`filter`** for subsetting  observations, **`summarize`** for reducing variables to summary statistics, typically stratified by groups, and **`mutate`** for creating new variables from mathematical expressions from existing variables. Some dplyr tools such as data joins we'll look at later in the data transformation chapter. 

## Background: Exploratory Data Analysis

In 1961, John Tukey proposed a new approach to data analysis, defining it as "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."  
<img src="img/Tukey1977.png" width="161" height="230" style="horizontal-align:right">He followed this up much later in 1977 with *Exploratory Data Analysis*. 

Exploratory data analysis (EDA) in part as an approach to analyzing data via summaries, tables and graphics.  The key word is *exploratory*, in contrast with *confirmatory* statistics. Both are important, but ignoring exploration is ignoring enlightenment.

Some purposes of EDA are:

- to suggest hypotheses
- to assess assumptions on which inference will be based
- to select appropriate inferential statistical tools
- to guide further data collection

These concepts led to the development of S at Bell Labs (John Chambers, 1976), then R, built on clear design and extensive, clear graphics.

## The Tidyverse and what we'll explore in this chapter

The Tidyverse refers to a suite of R packages developed at RStudio (see <a href="https://rstudio.com">R Studio</a> and <a href="https://r4ds.had.co.nz">R for Data Science</a>) for facilitating data processing and analysis. While R itself is designed around EDA, the Tidyverse takes it further. Some of the packages in the Tidyverse that are widely used are:

- **`dplyr`** : data manipulation like a database
- **`readr`** : better methods for reading and writing rectangular data
- **`tidyr`** : reorganization methods that extend dplyr's database capabilities
- **`purrr`** : expanded programming toolkit including enhanced "apply" methods
- **`tibble`** : improved data frame
- **`stringr`** : string manipulation library
- **`ggplot2`** : graphing system based on *the grammar of graphics*

In this chapter, we'll be mostly exploring **dplyr**, with a few other things thrown in like reading data frames with **readr**. For simplicity, we can just include `library(tidyverse)` to get everything.

## Tibbles

Tibbles are an improved type of data frame

- part of the Tidyverse
- serve the same purpose as a data frame, and all data frame operations work

Advantages

- display better
- can be composed of more complex objects like lists, etc.
- can be grouped

How created


- Reading from a CSV, using one of a variety of Tidyverse functions similarly named to base functions:
    - `read_csv` creates a tibble (in general, underscores are used in the Tidyverse)
    - `read.csv` creates a regular data frame

[**`air_quality` project**] [@TRI]
 
```{r message=F}
library(tidyverse) # includes readr, ggplot2, and dplyr which we'll use in this chapter
library(iGIScData)
csvPath <- system.file("extdata","TRI_1987_BaySites.csv", package="iGIScData")
TRI87 <- read_csv(csvPath)
```

- You can also use the `tibble()` function

```{r message=F, warning=F}
a <- rnorm(10)
b <- runif(10)
ab <- tibble(a,b)
ab
```
### `read_csv` vs. `read.csv`

You might be tempted to use read.csv from base R

- They look a lot alike, so you might confuse them
- You don't need to load library(readr)
- read.csv "fixes" some things and that might be desired:
problematic field names like   `MLY-TAVG-NORMAL` become `MLY.TAVG.NORMAL`
- numbers stored as characters are converted to numbers
"01" becomes 1, "02" becomes 2, etc.

However, there are potential problems

- You may not want some of those changes, and want to specify those changes separately
- There are known problems that read_csv avoids

Recommendation:  Use `read_csv` and `write_csv`.

## Statistical summary of variables

A simple statistical summary is very easy to do, and we'll use **`eucoak`** data in the `iGIScData` package from a study of comparative runoff and erosion under Eucalyptus and oak canopies [@eucoak]:

```{r}
summary(eucoakrainfallrunoffTDR)
```

In this study, we looked at the amount of runoff and erosion captured in Gerlach troughs on paired eucalyptus and oak sites in the San Francisco Bay Area.

<img src="img/eucoak.png">

```{r echo=F}
library(tidyverse); library(sf); library(leaflet)
sites <- read_csv(system.file("extdata","eucoakSites.csv", package="iGIScData"))
leaflet(data = sites) %>%
  addTiles() %>%
  addMarkers(~long, ~lat, popup = ~Site, label = ~Site)
```

## Visualizing data with a Tukey box plot

```{r warning=F}
ggplot(data = eucoakrainfallrunoffTDR) + geom_boxplot(mapping = aes(x=site, y=runoffL_euc))
```

## Database operations with `dplyr`

As part of exploring our data, we'll typically simplify or reduce it for our purposes. 
The following methods are quickly discovered to be essential as part of exploring and analyzing data. 

- **select rows** using logic, such as population > 10000, with `filter`
- **select variable columns** you want to retain with `select`
- **add** new variables and assign their values with `mutate`
- **sort** rows based on a a field with `arrange` 
- **summarize** by group

### Select, mutate, and the pipe

**The pipe `%>%`**:  Read `%>%` as "and then..."  This is bigger than it sounds and opens up a lot of possibilities.  See example below, and observe how the expression becomes several lines long. In the process, we'll see examples of new variables with mutate and selecting (and in the process *ordering*) variables. [If you haven't already created it, this should be in a **`eucoak` project**]

```{r}
runoff <- eucoakrainfallrunoffTDR %>%
  mutate(Date = as.Date(date,"%m/%d/%Y"),
         rain_subcanopy = (rain_oak + rain_euc)/2) %>%
  dplyr::select(site, Date, rain_mm, rain_subcanopy, 
         runoffL_oak, runoffL_euc, slope_oak, slope_euc)
library(DT)
DT::datatable(runoff,options=list(scrollX=T))
```

*Note: to just rename a variable, use `rename` instead of `mutate`. It will stay in position.*

**Helper functions for `select()`**

In the `select()` example above, we listed all of the variables, but there are a variety
of helper functions for using logic to specify which variables to select:
 
- `contains("_")` or any substring of interest in the variable name
- `starts_with("runoff")
- `ends_with("euc")`
- `everything()`
- `matches()` a regular expression
- `num_range("x",1:5)` for the common situation where a series of variable names combine a string and a number
- `one_of(myList)` for when you have a group of variable names
- range of variable: e.g. runoffL_oak:slope_euc could have followed rain_subcanopy above
- all but (-): preface a variable or a set of variabe names with - to select all others
 
### filter
 
 `filter` lets you select observations that meet criteria, similar to an SQL WHERE clause.
 
```{r}
runoff2007 <- runoff %>%
  filter(Date >= as.Date("01/01/2007", "%m/%d/%Y"))
DT::datatable(runoff2007,options=list(scrollX=T))
```
 **Filtering out NA with `!is.na`**
 
 Here's an important one. There are many times you need to avoid NAs.  
 We commonly see summary statistics using `na.rm = TRUE` in order to *ignore* NAs when calculating a statistic like `mean`.
 
 To simply filter out NAs from a vector or a variable use a filter:
 `feb_filt <- feb_s %>% filter(!is.na(TEMP))`
 
### Writing a data frame to a csv

Let's say you have created a data frame, maybe with read_csv

`runoff20062007 <- read_csv(csvPath)`

Then you do some processing to change it, maybe adding variables, reorganizing, etc., and you want to write out your new `eucoak`, so you just need to use `write_csv`

`write_csv(eucoak, "data/tidy_eucoak.csv")`

### Summarize by group

You'll find that you need to use this all the time with real data. You have a bunch of data where some categorical variable is defining a grouping, like our site field in the eucoak data. We'd like to just create average slope, rainfall, and runoff for each site. Note that it involves two steps, first defining which field defines the group, then the various summary statistics we'd like to store.  In this case all of the slopes under oak remain the same for a given site -- it's a *site* characteristic -- and the same applies to the euc site, so we can just grab the first value (mean would have also worked of course).

```{r warning=F, message=F}
eucoakSiteAvg <- runoff %>%
  group_by(site) %>%
  summarize(
    rain = mean(rain_mm, na.rm = TRUE),
    rain_subcanopy = mean(rain_subcanopy, na.rm = TRUE),
    runoffL_oak = mean(runoffL_oak, na.rm = TRUE),
    runoffL_euc = mean(runoffL_euc, na.rm = TRUE),
    slope_oak = first(slope_oak),
    slope_euc = first(slope_euc)
  )
eucoakSiteAvg
```


**Summarizing by group with TRI data [`air_quality` project]** [@TRI]

```{r, message=FALSE, warning=F}
csvPath <- system.file("extdata","TRI_2017_CA.csv", package="iGIScData")
TRI_BySite <- read_csv(csvPath) %>%
  mutate(all_air = `5.1_FUGITIVE_AIR` + `5.2_STACK_AIR`) %>%
  filter(all_air > 0) %>%
  group_by(FACILITY_NAME) %>%
  summarize(
    FACILITY_NAME = first(FACILITY_NAME),
    air_releases = sum(all_air, na.rm = TRUE),
    mean_fugitive = mean(`5.1_FUGITIVE_AIR`, na.rm = TRUE), 
    LATITUDE = first(LATITUDE), LONGITUDE = first(LONGITUDE))

```

### Count

Count is a simple variant on summarize by group, since the only statistic is the count of events. [**`eucoak` project**]

```{r message=F, warning=F}
tidy_eucoak %>% count(tree)

```

**Another way is to use n():**

```{r message=F, warning=F}
tidy_eucoak %>%
  group_by(tree) %>%
  summarize(n = n())
```

### Sorting after summarizing

Using the marine debris data from the *Marine Debris Monitoring and Assessment Project* [@marineDebris] [in a **new `litter` project**]
```{r message=F, warning=F}
shorelineLatLong <- ConcentrationReport %>%
  group_by(`Shoreline Name`) %>%
  summarize(
    latitude = mean((`Latitude Start`+`Latitude End`)/2),
    longitude = mean((`Longitude Start`+`Longitude End`)/2)
  ) %>%
  arrange(latitude)
shorelineLatLong

```

## The dot operator

The dot "." operator derives from UNIX syntax, and refers to "here".

- For accessing files in the current folder, the path is "./filename"

A similar specification is used in piped sequences

- The advantage of the pipe is you don't have to keep referencing the data frame.
- The dot is then used to connect to items inside the data frame [in **`air_quality`**]

```{r message=F, warning=F}
csvPath <- system.file("extdata","TRI_1987_BaySites.csv", package="iGIScData")
TRI87 <- read_csv(csvPath)
stackrate <- TRI87 %>%
  mutate(stackrate = stack_air/air_releases) %>%
  .$stackrate
head(stackrate)
```

## Exercises

1. Create a tibble with 20 rows of two variables `norm` and `unif` with `norm` created with `rnorm()` and `unif` created with `runif()`. [**`generic_methods`**]

2. Read in "TRI_2017_CA.csv" [**`air_quality`**] in two ways, as a normal data frame assigned to df and as a tibble assigned to tb. What field names result for what's listed in the CSV as `5.1_FUGITIVE_AIR`?

3. Use the summary function to investigate the variables in either the data.frame or tibble you just created. What type of field and what values are assigned to BIA_CODE?

4. Create a boxplot of `body_mass_g` by `species` from the `penguins` data frame in the palmerpenguins package [in a **`penguins` project**] [@palmer]. Access the data with data(package = 'palmerpenguins'), and also remember `library(ggplot2)` or `library(tidyverse)`.

```{r include=FALSE}
library(tidyverse)
library(palmerpenguins)
```

```{r include=FALSE}
ggplot(penguins, aes(x=species, y=body_mass_g)) + geom_boxplot()
```


5. Use select, mutate, and the pipe to create a penguinMass tibble where the only original variable retained is species, but with body_mass_kg created as $\frac{1}{1000}$ the body_mass_g. The statement should start with `penguinMass <- penguins` and use a pipe plus the other functions after that.

```{r include=FALSE}
penguinMass <- penguins %>%
  mutate(body_mass_kg = body_mass_g / 1000) %>%
  dplyr::select(species, body_mass_kg)
penguinMass
```

6. Now, also with penguins, create FemaleChinstaps to include only the female Chinstrap penguins. Start with `FemaleChinstraps <- penguins %>%`

```{r include=FALSE}
FemaleChinstraps <- penguins %>%
  filter(sex == "female") %>%
  filter(species == "Chinstrap")
FemaleChinstraps
```

7. Now, summarize by `species` groups to create mean and standard deviation variables from `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, and `body_mass_g`. Preface the variable names with either `avg.` or `sd.` Include `na.rm=T` with all statistics function calls.

```{r include=FALSE}
library(palmerpenguins); library(tidyverse)
penguins %>%
  group_by(species, sex) %>%
  summarize(avg.bill_length_mm = mean(bill_length_mm, na.rm=T),
            avg.bill_depth_mm = mean(bill_depth_mm, na.rm=T),
            avg.flipper_length_mm = mean(flipper_length_mm, na.rm=T),
            avg.body_mass_g = mean(body_mass_g, na.rm=T),
            sd.bill_length_mm = sd(bill_length_mm, na.rm=T),
            sd.bill_depth_mm = sd(bill_depth_mm, na.rm=T),
            sd.flipper_length_mm = sd(flipper_length_mm, na.rm=T),
            sd.body_mass_g = sd(body_mass_g, na.rm=T))

```

8. Sort the penguins by `body_mass_g`.

```{r include=FALSE}
penguins %>%
  arrange(body_mass_g)
```

## String Abstraction

Character string manipulation is surprisingly critical to data analysis, and so 
the **`stringr`** package was developed to provide a wider array of string processing
tools than what is in base R, including fucntions for detecting matches, subsetting strings,
managing lengths, replacing substrings with other text, and joining, splitting, and sorting strings. 
We'll look at some of the stringr functions, but a good way to learn about the wide array of functions 
is through the cheat sheet that can be downloaded from
<a href="https://www.rstudio.com/resources/cheatsheets/">https://www.rstudio.com/resources/cheatsheets/</a>.

### Detecting matches

These functions look for patterns within existing strings which can then be used subset observations
based on those patterns. [These can be investigated in your **`generic_methods` project**]

- **`str_detect`** detects patterns in a string, returns true or false if detected
- **`str_locate`** detects patterns in a string, returns start and end position if detected, or NA if not
- **`str_which`** returns the indices of strings that match a pattern
- **`str_count`** counts the number of matches in each string

```{r include=F}
library(tidyverse)
library(iGIScData)
```

```{r warning=F}
str_detect(fruit,"qu")
fruit[str_detect(fruit,"qu")]
tail(str_locate(fruit, "qu"),15)
str_which(fruit, "qu")
fruit[str_which(fruit,"qu")]
str_count(fruit,"qu")

```

### Subsetting Strings

Subsetting in this case includes its normal use of abstracting the observations specified 
by a match (similar to a filter for data frames), or just a specified part of a string specified by start and end character positions, or the part of the string that matches an expression.

- **`str_sub`** extracts a part of a string from a start to and end character position
- **`str_subset`** returns the strings that contain a pattern match
- **`str_extract`** returns the first (or if `str_extract_all` then all matches) pattern matches
- **`str_match`** returns the first (or `_all`) pattern match as a matrix

```{r}
qfruit <- str_subset(fruit, "q")
qfruit
str_sub(qfruit,1,2)
str_sub("94132",1,2)
str_extract(qfruit,"[aeiou]")
```

### String Length

The length of strings is often useful in an analysis process, either just knowing the length
as an integer, or purposefully increasing or reducing it. 

- **`str_length`** simply returns the length of the string as an integer
- **`str_pad`** adds a specified character (typically a space " ") to either end of a string
- **`str_trim`** removes whitespace from the either end of a string

```{r}
qfruit <- str_subset(fruit,"q")
qfruit
str_length(qfruit)
name <- "Inigo Montoya"
str_length(name)
firstname <- str_sub(name,1,str_locate(name," ")[1]-1)
firstname
lastname <- str_sub(name,str_locate(name," ")[1]+1,str_length(name))
lastname
str_pad(qfruit,10,"both")
str_trim(str_pad(qfruit,10,"both"),"both")
```

### Replacing substrings with other text ("mutating" strings)

These methods range from converting case to replace substrings.

- **`str_to_lower`** converts strings to lower case
- **`str_to_upper`** converts strings to upper case
- **`str_to_title`** capitalizes strings (makes the first character of each word upper case)
- **`str_sub`** a special use of this function to replace substrings with a specified string
- **`str_replace`** replaces the first matched pattern (or all with `str_replace_all`) with a specified string

```{r}
str_to_lower(name)
str_to_upper(name)
str_to_title("for whom the bell tolls")
str_sub(name,1,str_locate(name," ")-1) <- "Diego"
str_replace(qfruit,"q","z")
```

### Concatenating and splitting

One very common string function is that to concatenate strings, and somewhat less common
though useful is splitting them using a key separator like space, comma, or line end. One use of
using str_c in the example below is to create a comparable join field based on a numeric character 
string that might need a zero or something at the left or right.

- **`str_c`** The `paste()` function in base R will work but you might want the default separator setting to be "" instead
of " ", so `str_c` is just `paste` with a default "" separator, but you can also use " ". 
- **`str_split`** splits a string into parts based upon the detection of a specified separator like space, comma, or line end

```{r message=F}
str_split("for whom the bell tolls", " ")
str_c("for","whom","the","bell","tolls",sep=" ")
csvPath <- system.file("extdata","CA_MdInc.csv",package="iGIScData")
CA_MdInc <- read_csv(csvPath)
join_id <- str_c("0",CA_MdInc$NAME) # could also use str_pad(CA_MdInc$NAME,1,side="left",pad="0")
head(CA_MdInc)
head(join_id)
```

## Dates and times with `lubridate`

Makes it easy to work with dates and times.

- Can parse many forms
- We'll look at more with time series
- See the cheat sheet for more information, but the following examples may
demonstrate that it's pretty easy to use, and does a good job of making your 
job easier.

```{r}
library(lubridate)
dmy("20 September 2020")
dmy_hm("20 September 2020 10:45")
mdy_hms("September 20, 2020 10:48")
mdy_hm("9/20/20 10:50")
mdy("9.20.20")
start704 <- dmy_hm("24 August 2020 16:00")
end704 <- mdy_hm("12/18/2020 4:45 pm")
year(start704)
month(start704)
day(end704)
hour(end704)
end704-start704
as_date(end704)
hms::as_hms(end704)

```

Note the use of :: after the package
Sometimes you need to specify the package and function name this way, for instance if more than one package has a function of the same name.  
	`dplyr::select(...)`