forked from daranzolin/EnvDataSci
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
203 lines (148 loc) · 9.69 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
title: "Introduction to Environmental Data Science"
author:
- name: Jerry Davis, SFSU Institute for Geographic Information Science
contributor:
- name: Anna Studwell, SFSU Institute for Geographic Information Science (cover illustration)
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output:
bookdown::gitbook:
cover-image: "img/eaRth_anna.png"
bookdown::epub_book:
cover-image: "img/eaRth_anna.png"
documentclass: book
bibliography: [book.bib, packages.bib]
biblio-style: apalike
link-citations: yes
thanks: Cover illustration by Anna Studwell
github-repo: iGISc/EnvDataSci
description: "Background, concepts and exercises for using R for environmental data science. The focus is on applying the R language and various libraries for data abstraction, transformation, data analysis, spatial data/mapping, statistical modeling, and time series, applied to environmental research. Applies exploratory data analysis methods and tidyverse approaches in R, and includes contributed chapters presenting research applications, with associated data and code packages."
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
# Introduction {#chapter00.chapter_section .intro_section}
![Cover art by Anna Studwell](img/eaRth_anna296.png)
^[*"Dandelion fluff -- Ephemeral stalk sheds seeds to the universe"*]
Data science is *an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data* (Wikipedia). A data science approach is especially suitable for applications involving large and complex data sets, and environmental data is a prime example, with rapidly growing collections from automated sensors in space and time domains.
*Environmental* data science is data science applied to environmental science research. In general data science can be seen as being the intersection of math & statistics, computer science/IT, and some research domain, and in this case it's environmental:
<img src="img/EnvDataSci50pct.png">
## Environmental Data and Methods
The methods needed for environmental research can include many things since environmental *data* can include many things, including environmental measurements in time and space domains.
- Data analysis and transformation methods
- importing and other methods to create rectangular data frames
- reorganization and creation of fields
- filtering observations
- data joins
- stratified statistical summaries
- reorganizing data, including pivots
- Modeling
- physical models
- statistical modeling
- models based on machine learning algorithms
- Visualization
- graphics
- Spatial analysis & maps
- vector and raster spatial analysis, e.g.
- spatial joins
- distance analysis
- overlay analysis
- spatial statistics
- static and interactive maps
- Time series
- analyzing and visualizing long-term data records (e.g. for climate change)
- analyzing and visualizing high-frequency data from loggers
## Goals of this book
While the methodological reach of data science is very great, and the spectrum of environmental data is as well, our goal is to lay the foundation and provide useful introductory methods in the areas outlined above, but as a "live" book be able to extend into more advanced methods and provide a growing suite of research examples with associated data sets. We'll briefly explore some data mining methods that can be applied to so-called "big data" challenges, but our focus is on **exploratory data analysis** in general, applied to *environmental data in space and time domains*. Much of our data will be in fact be quite *small*, derived from field-based environmental measurements, and it will primarily be in the areas of time-series and imagery, where automated data capture is employed, when we'll dip our toes into what you might call big-data issues.
### Some definitions:
**Big data**: *data having a size or complexity too big to be processed effectively by traditional software*
- data with many cases or dimensions (including imagery)
- many applications in environmental science due to the great expansion of automated environmental data capture in space and time domains
- big data challenges exist across the spectrum of the environmental research process, from data capture, storage, sharing, visualization, querying
**Data Mining**: *discovering patterns in large data sets*
- databases collected by government agencies
- imagery data from satellite, aerial (including drone) sensors
- time-series data from long-term data records or high-frequency data loggers
- methods may involve machine learning / artificial intelligence / computer vision
**Exploratory data analysis**: *procedures for analyzing data, techniques for interpreting the results of such procedures, ways of structuring data to make its analysis easier*
- summarizing
- restructuring
- visualization
## Exploratory Data Analysis
Just as *exploration* is a part of what *National Geographic* has long covered, it's an important part of geographic and environmental science research.
<img src="img/CaveExploring.png">
**Exploratory Data Analysis** is exploration applied to data, and has grown as an alternative approach to traditional statistical analysis. This basic approach perhaps dates back to the work of Thomas Bayes in the 18th century, but @tukey1961 may have best articulated the basic goals of this approach in defining the "data analysis" methods he was promoting: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."
Some years later @tukey1977 followed up with *Exploratory Data Analysis*
<img src="img/Tukey1977.png">
- EDA is an approach to analyzing data via summaries and graphics. The key word is *exploratory*.
- In contrast to *confirmatory* statistics
- Objectives:
- suggest hypotheses
- assess assumptions on which inference will be based
- select appropriate statistical tools
- guide further data collection
- Led to the development of S, then R
- Built on clear design and extensive, clear graphics, one key to exploring data
- S Developed at Bell Labs by John Chambers, 1976
<img src="img/ChambersGraphicalMethodsForDataAnalysis.png">
- R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues.
<img src="img/R.png">
## Software and data we'll need
First, we're going to use the R language. It's not the only way to do data analysis,
and probably Python is the leading data science overall, but R is clearly the leading such
language for academic research, especially in the environmental sciences.
For a start, you'll need to have R and RStudio installed, and at least the following packages:
- tidyverse (to include ggplot2, dplyr, tidyr, stringr, etc.)
- ggplot2
- dplyr
- stringr
- tidyr
- lubridate
- sf
- raster
- tmap
There will be other packages we'll meet along the way.
To install these packages, the following code will work:
```
install.packages(c("tidyverse", "lubridate", "sf", "raster", "tmap"))
```
You can always add more packages as needed, do them one at a time, whatever. But
generally *don't reinstall the packages again with the unless you actually want to reinstall it*, maybe because it's been updated. So generally I don't include `install.packages()` in my script. Once installed, you can access the packages with the library function, e.g.
```{r message=F}
library(tidyverse)
```
which you *will* want to include in your script.
### Data
We'll be using data from various sources, including data on CRAN like the code packages
above which you install the same way -- so use `install.packages("palmerpenguins")`.
We've also created a repository on GitHub that includes data we've developed in the Institute for Geographic Information Science (iGISc) at SFSU, and you'll need to install that package a slightly different way.
<img src="img/logoIGISc200200.png" align=right>
GitHub packages require a bit more work on the user's part since we need to first install `remotes`^[Note: you can also use `devtools` instead of `remotes` if you have that installed. They do the same thing; `remotes` is a subset of `devtools`. If you see a message about Rtools, you can ignore it since that is only needed for building tools from C++ and things like that.], then use that to install the GitHub data package:
```
install.packages("remotes")
remotes::install_github("iGISc/iGIScData")
```
Then you can access it just like other built-in data by including:
```{r message=F}
library(iGIScData)
```
To see what's in it, you'll see the various datasets listed in:
```
data(package="iGIScData")
```
For instance, the following is a map of California counties using the CA_counties sf (simple features) data:
```{r}
library(tidyverse); library(iGIScData); library(sf)
ggplot(data=CA_counties) + geom_sf()
```
The package datasets can be used directly as sf data (if the sf library is installed) or data frames (all tibbles). Raw data can also be read from the `extdata` folder that is installed on your computer when you install the package, using code such as:
```
csvPath <- system.file("extdata","TRI_1987_BaySites.csv", package="iGIScData")
TRI87 <- read_csv(csvPath)
```
or something similar for shapefiles, such as:
```
shpPath <- system.file("extdata","trails.shp", package="iGIScData")
trails <- st_read(shpPath)
```