This repository has been archived by the owner on Apr 10, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathxx-why-R.Rmd
97 lines (80 loc) · 4.2 KB
/
xx-why-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
layout: page
title: R for reproducible scientific analysis
subtitle: Why use R?
minutes: xx
---
```{r, include=FALSE}
library(ggplot2)
theme_set(theme_bw())
source("tools/chunk-options.R")
library(dplyr)
gapminder <- tbl_df(read.csv("data/gapminder-FiveYearData.csv"))
#gapminder <- read.csv("data/gapminder-FiveYearData.csv", header=TRUE, sep=',')
```
> ### Learning Objectives {.objectives}
>
> * To get a taste of R's powerful visualisation capabilities
> * To get a taste of R's powerful statistical analysis capabilities
> * To show how interweaving those capabilities pays off
>
### Introduction to R
Welcome to the R portion of the Software Carpentry workshop. We're going to show you how R and RStudio can help you understand large data sets. We'll also guide your first steps towards using them effectively for your own work.
> ### Installation
> * Download RStudio from [http://www.rstudio.com/products/rstudio/download/](http://www.rstudio.com/products/rstudio/download/)
> * (Download gapminder data -- can we include in this repo?}
> * Once you've got RStudio installed, open it.
> * In the interactive console (left tab), type:
> * > install.packages('ggplot2', 'dplyr', 'tidyr')
> * and hit return, which will tell RStudio to find and install packages that we're going to use.
> * (what else?)
We're going to start with a simple but powerful example of how R can help you visualize, manipulate, and analyze data. In the interactive console, enter each command. Later lessons will go more deeply into what they do and how to effectively leverage R and its packages.
Let's start by loading a data set and seeing how big it is.
```{r}
gapminder <- read.csv("data/gapminder-FiveYearData.csv", header=TRUE, sep=',')
nrow(gapminder)
```
1,704 entries: that's too many to understand by reading. Let's look at the first few entries to get a better sense of what we have:
```{r}
head(gapminder)
```
Interesting: the data concerns countries, years, "pop", "lifeExp", and "gdpPercap". (The person who created the data set choose those abbreviations for "Population", "Life Expectancy", and "GDP per Capita," respectively.) Let's see if we can get a better handle on it by visualizing it. Load the `ggplot2` plotting package and construct a scatter plot.
```{r lifeExp-gdpPercap-scatter, message=FALSE}
library(ggplot2)
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point()
```
In the lower right panel, you should see a graph that RStudio produced in response to your command. What can you tell about this data set from this initial graph?
This first graph suggests a relationship between life expectancy and GDP per capita. Another relationship we might be interested in is the change in life expectancy over time by country and continent.
```{r year-lifeExp}
ggplot(data = gapminder, aes(x = year, y = lifeExp, by = country, colour = continent)) +
geom_line() +
geom_point()
```
The plots above are great for visualizing data, but what if we want to figure out something quantitative about the relationships and patterns we observed? R gives you flexible and powerful tools to do manipulation and computation on your data.
Let's use the `dplyr` package to find the pairwise correlations between life expectancy, GDP per capita, and population.
```{r}
library(dplyr)
cors <- gapminder %>%
group_by(year) %>%
summarise(gdpPercap.lifeExp = cor(gdpPercap, lifeExp),
gdpPercap.pop = cor(gdpPercap, pop),
pop.lifeExp = cor(pop, lifeExp))
head(cors)
```
This is interesting, but it's now in a form that's hard to give to ggplot. We can use the `tidyr` package to put the data into tidy form.
```{r}
library(tidyr)
tidy.cors <- cors %>%
gather(variables, correlation, gdpPercap.lifeExp, gdpPercap.pop, pop.lifeExp)
head(tidy.cors)
```
Now we can visualize all of these relationships on one plot, and see how the correlations between all these variables change over time.
```{r year-cors, fig.width=12}
ggplot(tidy.cors, aes(x = year, y = correlation, colour = variables)) +
geom_point() +
geom_line() +
theme_bw()
```
Just a few minutes with R, and we have learned that our data set contains a string and interesting relationship between GDP per capita and life expectancy.
Now let's dig into the details of using R.