forked from daranzolin/EnvDataSci
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathabstraction.Rmd
496 lines (359 loc) · 19.5 KB
/
abstraction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
# Data Abstraction
At this point, we've learned the basics of working with the R language. From here we'll want to explore how to analyze data, both statistically and spatially. One part of this is abstracting information from existing data sets by selecting variables and observations and summarizing their statistics.
Some useful methods for data abstraction can be found in the various packages of "The Tidyverse" [@wickham2017tidyverse] which can be included all at once with the **`tidyverse`** package. We'll start with **`dplyr`**, which includes an array of data manipulation tools, including **`select`** for selecting variables, **`filter`** for subsetting observations, **`summarize`** for reducing variables to summary statistics, typically stratified by groups, and **`mutate`** for creating new variables from mathematical expressions from existing variables. Some dplyr tools such as data joins we'll look at later in the data transformation chapter.
## Background: Exploratory Data Analysis
In 1961, John Tukey proposed a new approach to data analysis, defining it as "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."
<img src="img/Tukey1977.png" width="161" height="230" style="horizontal-align:right">He followed this up much later in 1977 with *Exploratory Data Analysis*.
Exploratory data analysis (EDA) in part as an approach to analyzing data via summaries, tables and graphics. The key word is *exploratory*, in contrast with *confirmatory* statistics. Both are important, but ignoring exploration is ignoring enlightenment.
Some purposes of EDA are:
- to suggest hypotheses
- to assess assumptions on which inference will be based
- to select appropriate inferential statistical tools
- to guide further data collection
These concepts led to the development of S at Bell Labs (John Chambers, 1976), then R, built on clear design and extensive, clear graphics.
## The Tidyverse and what we'll explore in this chapter
The Tidyverse refers to a suite of R packages developed at RStudio (see <a href="https://rstudio.com">R Studio</a> and <a href="https://r4ds.had.co.nz">R for Data Science</a>) for facilitating data processing and analysis. While R itself is designed around EDA, the Tidyverse takes it further. Some of the packages in the Tidyverse that are widely used are:
- **`dplyr`** : data manipulation like a database
- **`readr`** : better methods for reading and writing rectangular data
- **`tidyr`** : reorganization methods that extend dplyr's database capabilities
- **`purrr`** : expanded programming toolkit including enhanced "apply" methods
- **`tibble`** : improved data frame
- **`stringr`** : string manipulation library
- **`ggplot2`** : graphing system based on *the grammar of graphics*
In this chapter, we'll be mostly exploring **dplyr**, with a few other things thrown in like reading data frames with **readr**. For simplicity, we can just include `library(tidyverse)` to get everything.
## Tibbles
Tibbles are an improved type of data frame
- part of the Tidyverse
- serve the same purpose as a data frame, and all data frame operations work
Advantages
- display better
- can be composed of more complex objects like lists, etc.
- can be grouped
How created
- Reading from a CSV, using one of a variety of Tidyverse functions similarly named to base functions:
- `read_csv` creates a tibble (in general, underscores are used in the Tidyverse)
- `read.csv` creates a regular data frame
[**`air_quality` project**] [@TRI]
```{r message=F}
library(tidyverse) # includes readr, ggplot2, and dplyr which we'll use in this chapter
library(iGIScData)
csvPath <- system.file("extdata","TRI_1987_BaySites.csv", package="iGIScData")
TRI87 <- read_csv(csvPath)
```
- You can also use the `tibble()` function
```{r message=F, warning=F}
a <- rnorm(10)
b <- runif(10)
ab <- tibble(a,b)
ab
```
### `read_csv` vs. `read.csv`
You might be tempted to use read.csv from base R
- They look a lot alike, so you might confuse them
- You don't need to load library(readr)
- read.csv "fixes" some things and that might be desired:
problematic field names like `MLY-TAVG-NORMAL` become `MLY.TAVG.NORMAL`
- numbers stored as characters are converted to numbers
"01" becomes 1, "02" becomes 2, etc.
However, there are potential problems
- You may not want some of those changes, and want to specify those changes separately
- There are known problems that read_csv avoids
Recommendation: Use `read_csv` and `write_csv`.
## Statistical summary of variables
A simple statistical summary is very easy to do, and we'll use **`eucoak`** data in the `iGIScData` package from a study of comparative runoff and erosion under Eucalyptus and oak canopies [@eucoak]:
```{r}
summary(eucoakrainfallrunoffTDR)
```
In this study, we looked at the amount of runoff and erosion captured in Gerlach troughs on paired eucalyptus and oak sites in the San Francisco Bay Area.
<img src="img/eucoak.png">
```{r echo=F}
library(tidyverse); library(sf); library(leaflet)
sites <- read_csv(system.file("extdata","eucoakSites.csv", package="iGIScData"))
leaflet(data = sites) %>%
addTiles() %>%
addMarkers(~long, ~lat, popup = ~Site, label = ~Site)
```
## Visualizing data with a Tukey box plot
```{r warning=F}
ggplot(data = eucoakrainfallrunoffTDR) + geom_boxplot(mapping = aes(x=site, y=runoffL_euc))
```
## Database operations with `dplyr`
As part of exploring our data, we'll typically simplify or reduce it for our purposes.
The following methods are quickly discovered to be essential as part of exploring and analyzing data.
- **select rows** using logic, such as population > 10000, with `filter`
- **select variable columns** you want to retain with `select`
- **add** new variables and assign their values with `mutate`
- **sort** rows based on a a field with `arrange`
- **summarize** by group
### Select, mutate, and the pipe
**The pipe `%>%`**: Read `%>%` as "and then..." This is bigger than it sounds and opens up a lot of possibilities. See example below, and observe how the expression becomes several lines long. In the process, we'll see examples of new variables with mutate and selecting (and in the process *ordering*) variables. [If you haven't already created it, this should be in a **`eucoak` project**]
```{r}
runoff <- eucoakrainfallrunoffTDR %>%
mutate(Date = as.Date(date,"%m/%d/%Y"),
rain_subcanopy = (rain_oak + rain_euc)/2) %>%
dplyr::select(site, Date, rain_mm, rain_subcanopy,
runoffL_oak, runoffL_euc, slope_oak, slope_euc)
library(DT)
DT::datatable(runoff,options=list(scrollX=T))
```
*Note: to just rename a variable, use `rename` instead of `mutate`. It will stay in position.*
**Helper functions for `select()`**
In the `select()` example above, we listed all of the variables, but there are a variety
of helper functions for using logic to specify which variables to select:
- `contains("_")` or any substring of interest in the variable name
- `starts_with("runoff")
- `ends_with("euc")`
- `everything()`
- `matches()` a regular expression
- `num_range("x",1:5)` for the common situation where a series of variable names combine a string and a number
- `one_of(myList)` for when you have a group of variable names
- range of variable: e.g. runoffL_oak:slope_euc could have followed rain_subcanopy above
- all but (-): preface a variable or a set of variabe names with - to select all others
### filter
`filter` lets you select observations that meet criteria, similar to an SQL WHERE clause.
```{r}
runoff2007 <- runoff %>%
filter(Date >= as.Date("01/01/2007", "%m/%d/%Y"))
DT::datatable(runoff2007,options=list(scrollX=T))
```
**Filtering out NA with `!is.na`**
Here's an important one. There are many times you need to avoid NAs.
We commonly see summary statistics using `na.rm = TRUE` in order to *ignore* NAs when calculating a statistic like `mean`.
To simply filter out NAs from a vector or a variable use a filter:
`feb_filt <- feb_s %>% filter(!is.na(TEMP))`
### Writing a data frame to a csv
Let's say you have created a data frame, maybe with read_csv
`runoff20062007 <- read_csv(csvPath)`
Then you do some processing to change it, maybe adding variables, reorganizing, etc., and you want to write out your new `eucoak`, so you just need to use `write_csv`
`write_csv(eucoak, "data/tidy_eucoak.csv")`
### Summarize by group
You'll find that you need to use this all the time with real data. You have a bunch of data where some categorical variable is defining a grouping, like our site field in the eucoak data. We'd like to just create average slope, rainfall, and runoff for each site. Note that it involves two steps, first defining which field defines the group, then the various summary statistics we'd like to store. In this case all of the slopes under oak remain the same for a given site -- it's a *site* characteristic -- and the same applies to the euc site, so we can just grab the first value (mean would have also worked of course).
```{r warning=F, message=F}
eucoakSiteAvg <- runoff %>%
group_by(site) %>%
summarize(
rain = mean(rain_mm, na.rm = TRUE),
rain_subcanopy = mean(rain_subcanopy, na.rm = TRUE),
runoffL_oak = mean(runoffL_oak, na.rm = TRUE),
runoffL_euc = mean(runoffL_euc, na.rm = TRUE),
slope_oak = first(slope_oak),
slope_euc = first(slope_euc)
)
eucoakSiteAvg
```
**Summarizing by group with TRI data [`air_quality` project]** [@TRI]
```{r, message=FALSE, warning=F}
csvPath <- system.file("extdata","TRI_2017_CA.csv", package="iGIScData")
TRI_BySite <- read_csv(csvPath) %>%
mutate(all_air = `5.1_FUGITIVE_AIR` + `5.2_STACK_AIR`) %>%
filter(all_air > 0) %>%
group_by(FACILITY_NAME) %>%
summarize(
FACILITY_NAME = first(FACILITY_NAME),
air_releases = sum(all_air, na.rm = TRUE),
mean_fugitive = mean(`5.1_FUGITIVE_AIR`, na.rm = TRUE),
LATITUDE = first(LATITUDE), LONGITUDE = first(LONGITUDE))
```
### Count
Count is a simple variant on summarize by group, since the only statistic is the count of events. [**`eucoak` project**]
```{r message=F, warning=F}
tidy_eucoak %>% count(tree)
```
**Another way is to use n():**
```{r message=F, warning=F}
tidy_eucoak %>%
group_by(tree) %>%
summarize(n = n())
```
### Sorting after summarizing
Using the marine debris data from the *Marine Debris Monitoring and Assessment Project* [@marineDebris] [in a **new `litter` project**]
```{r message=F, warning=F}
shorelineLatLong <- ConcentrationReport %>%
group_by(`Shoreline Name`) %>%
summarize(
latitude = mean((`Latitude Start`+`Latitude End`)/2),
longitude = mean((`Longitude Start`+`Longitude End`)/2)
) %>%
arrange(latitude)
shorelineLatLong
```
## The dot operator
The dot "." operator derives from UNIX syntax, and refers to "here".
- For accessing files in the current folder, the path is "./filename"
A similar specification is used in piped sequences
- The advantage of the pipe is you don't have to keep referencing the data frame.
- The dot is then used to connect to items inside the data frame [in **`air_quality`**]
```{r message=F, warning=F}
csvPath <- system.file("extdata","TRI_1987_BaySites.csv", package="iGIScData")
TRI87 <- read_csv(csvPath)
stackrate <- TRI87 %>%
mutate(stackrate = stack_air/air_releases) %>%
.$stackrate
head(stackrate)
```
## Exercises
1. Create a tibble with 20 rows of two variables `norm` and `unif` with `norm` created with `rnorm()` and `unif` created with `runif()`. [**`generic_methods`**]
2. Read in "TRI_2017_CA.csv" [**`air_quality`**] in two ways, as a normal data frame assigned to df and as a tibble assigned to tb. What field names result for what's listed in the CSV as `5.1_FUGITIVE_AIR`?
3. Use the summary function to investigate the variables in either the data.frame or tibble you just created. What type of field and what values are assigned to BIA_CODE?
4. Create a boxplot of `body_mass_g` by `species` from the `penguins` data frame in the palmerpenguins package [in a **`penguins` project**] [@palmer]. Access the data with data(package = 'palmerpenguins'), and also remember `library(ggplot2)` or `library(tidyverse)`.
```{r include=FALSE}
library(tidyverse)
library(palmerpenguins)
```
```{r include=FALSE}
ggplot(penguins, aes(x=species, y=body_mass_g)) + geom_boxplot()
```
5. Use select, mutate, and the pipe to create a penguinMass tibble where the only original variable retained is species, but with body_mass_kg created as $\frac{1}{1000}$ the body_mass_g. The statement should start with `penguinMass <- penguins` and use a pipe plus the other functions after that.
```{r include=FALSE}
penguinMass <- penguins %>%
mutate(body_mass_kg = body_mass_g / 1000) %>%
dplyr::select(species, body_mass_kg)
penguinMass
```
6. Now, also with penguins, create FemaleChinstaps to include only the female Chinstrap penguins. Start with `FemaleChinstraps <- penguins %>%`
```{r include=FALSE}
FemaleChinstraps <- penguins %>%
filter(sex == "female") %>%
filter(species == "Chinstrap")
FemaleChinstraps
```
7. Now, summarize by `species` groups to create mean and standard deviation variables from `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, and `body_mass_g`. Preface the variable names with either `avg.` or `sd.` Include `na.rm=T` with all statistics function calls.
```{r include=FALSE}
library(palmerpenguins); library(tidyverse)
penguins %>%
group_by(species, sex) %>%
summarize(avg.bill_length_mm = mean(bill_length_mm, na.rm=T),
avg.bill_depth_mm = mean(bill_depth_mm, na.rm=T),
avg.flipper_length_mm = mean(flipper_length_mm, na.rm=T),
avg.body_mass_g = mean(body_mass_g, na.rm=T),
sd.bill_length_mm = sd(bill_length_mm, na.rm=T),
sd.bill_depth_mm = sd(bill_depth_mm, na.rm=T),
sd.flipper_length_mm = sd(flipper_length_mm, na.rm=T),
sd.body_mass_g = sd(body_mass_g, na.rm=T))
```
8. Sort the penguins by `body_mass_g`.
```{r include=FALSE}
penguins %>%
arrange(body_mass_g)
```
## String Abstraction
Character string manipulation is surprisingly critical to data analysis, and so
the **`stringr`** package was developed to provide a wider array of string processing
tools than what is in base R, including fucntions for detecting matches, subsetting strings,
managing lengths, replacing substrings with other text, and joining, splitting, and sorting strings.
We'll look at some of the stringr functions, but a good way to learn about the wide array of functions
is through the cheat sheet that can be downloaded from
<a href="https://www.rstudio.com/resources/cheatsheets/">https://www.rstudio.com/resources/cheatsheets/</a>.
### Detecting matches
These functions look for patterns within existing strings which can then be used subset observations
based on those patterns. [These can be investigated in your **`generic_methods` project**]
- **`str_detect`** detects patterns in a string, returns true or false if detected
- **`str_locate`** detects patterns in a string, returns start and end position if detected, or NA if not
- **`str_which`** returns the indices of strings that match a pattern
- **`str_count`** counts the number of matches in each string
```{r include=F}
library(tidyverse)
library(iGIScData)
```
```{r warning=F}
str_detect(fruit,"qu")
fruit[str_detect(fruit,"qu")]
tail(str_locate(fruit, "qu"),15)
str_which(fruit, "qu")
fruit[str_which(fruit,"qu")]
str_count(fruit,"qu")
```
### Subsetting Strings
Subsetting in this case includes its normal use of abstracting the observations specified
by a match (similar to a filter for data frames), or just a specified part of a string specified by start and end character positions, or the part of the string that matches an expression.
- **`str_sub`** extracts a part of a string from a start to and end character position
- **`str_subset`** returns the strings that contain a pattern match
- **`str_extract`** returns the first (or if `str_extract_all` then all matches) pattern matches
- **`str_match`** returns the first (or `_all`) pattern match as a matrix
```{r}
qfruit <- str_subset(fruit, "q")
qfruit
str_sub(qfruit,1,2)
str_sub("94132",1,2)
str_extract(qfruit,"[aeiou]")
```
### String Length
The length of strings is often useful in an analysis process, either just knowing the length
as an integer, or purposefully increasing or reducing it.
- **`str_length`** simply returns the length of the string as an integer
- **`str_pad`** adds a specified character (typically a space " ") to either end of a string
- **`str_trim`** removes whitespace from the either end of a string
```{r}
qfruit <- str_subset(fruit,"q")
qfruit
str_length(qfruit)
name <- "Inigo Montoya"
str_length(name)
firstname <- str_sub(name,1,str_locate(name," ")[1]-1)
firstname
lastname <- str_sub(name,str_locate(name," ")[1]+1,str_length(name))
lastname
str_pad(qfruit,10,"both")
str_trim(str_pad(qfruit,10,"both"),"both")
```
### Replacing substrings with other text ("mutating" strings)
These methods range from converting case to replace substrings.
- **`str_to_lower`** converts strings to lower case
- **`str_to_upper`** converts strings to upper case
- **`str_to_title`** capitalizes strings (makes the first character of each word upper case)
- **`str_sub`** a special use of this function to replace substrings with a specified string
- **`str_replace`** replaces the first matched pattern (or all with `str_replace_all`) with a specified string
```{r}
str_to_lower(name)
str_to_upper(name)
str_to_title("for whom the bell tolls")
str_sub(name,1,str_locate(name," ")-1) <- "Diego"
str_replace(qfruit,"q","z")
```
### Concatenating and splitting
One very common string function is that to concatenate strings, and somewhat less common
though useful is splitting them using a key separator like space, comma, or line end. One use of
using str_c in the example below is to create a comparable join field based on a numeric character
string that might need a zero or something at the left or right.
- **`str_c`** The `paste()` function in base R will work but you might want the default separator setting to be "" instead
of " ", so `str_c` is just `paste` with a default "" separator, but you can also use " ".
- **`str_split`** splits a string into parts based upon the detection of a specified separator like space, comma, or line end
```{r message=F}
str_split("for whom the bell tolls", " ")
str_c("for","whom","the","bell","tolls",sep=" ")
csvPath <- system.file("extdata","CA_MdInc.csv",package="iGIScData")
CA_MdInc <- read_csv(csvPath)
join_id <- str_c("0",CA_MdInc$NAME) # could also use str_pad(CA_MdInc$NAME,1,side="left",pad="0")
head(CA_MdInc)
head(join_id)
```
## Dates and times with `lubridate`
Makes it easy to work with dates and times.
- Can parse many forms
- We'll look at more with time series
- See the cheat sheet for more information, but the following examples may
demonstrate that it's pretty easy to use, and does a good job of making your
job easier.
```{r}
library(lubridate)
dmy("20 September 2020")
dmy_hm("20 September 2020 10:45")
mdy_hms("September 20, 2020 10:48")
mdy_hm("9/20/20 10:50")
mdy("9.20.20")
start704 <- dmy_hm("24 August 2020 16:00")
end704 <- mdy_hm("12/18/2020 4:45 pm")
year(start704)
month(start704)
day(end704)
hour(end704)
end704-start704
as_date(end704)
hms::as_hms(end704)
```
Note the use of :: after the package
Sometimes you need to specify the package and function name this way, for instance if more than one package has a function of the same name.
`dplyr::select(...)`