lesson5_student.rmd

Lesson 5
========================================================

### Multivariate Data
Notes:

***

### Moira Perceived Audience Size Colored by Age
Notes:

***

### Third Qualitative Variable
Notes:

```{r Third Qualitative Variable}
setwd("~/Repos//data-analysis-with-r")
library(ggplot2)
library(dplyr)
pf <- read.csv("./data/pseudo_facebook.tsv", sep ="\t")

ggplot(aes(x = gender, y = age),
       data = subset(pf, !is.na(gender))) + geom_boxplot() +
  stat_summary(fun.y = mean, geom = "point", shape = 4)

ggplot(data = subset(pf, !is.na(gender)),
       aes(x=age, y=friend_count)) + 
  geom_line(aes(color=gender), stat = "summary", fun.y = median )

# Write code to create a new data frame,
# called 'pf.fc_by_age_gender', that contains
# information on each age AND gender group.

# The data frame should contain the following variables:

#    mean_friend_count,
#    median_friend_count,
#    n (the number of users in each age and gender grouping)


pf.fc_by_age_gender <- subset(pf, !is.na(gender)) %>%
  group_by(age,gender) %>%
  summarise(mean_friend_count = mean(friend_count),
            median_friend_count = median(as.numeric(friend_count)),
            n = n())

head(pf.fc_by_age_gender)

```

***

### Plotting Conditional Summaries
Notes:

```{r Plotting Conditional Summaries}
ggplot(pf.fc_by_age_gender, aes(x=age,y=median_friend_count)) + 
  geom_line(aes(color = gender))
```

***

### Thinking in Ratios
Notes:

***

### Wide and Long Format
Notes:

***

### Reshaping Data
Notes:

```{r}
install.packages('reshape2')
library(reshape2)

pf.fc_by_age_gender.wider <- dcast(pf.fc_by_age_gender,
                                   age ~ gender,
                                   value.var = "median_friend_count")
```


***

### Ratio Plot
Notes:

```{r Ratio Plot}
# Plot the ratio of the female to male median
# friend counts using the data frame
# pf.fc_by_age_gender.wide.

# Think about what geom you should use.
# Add a horizontal line to the plot with
# a y intercept of 1, which will be the
# base line. Look up the documentation
# for geom_hline to do that. Use the parameter
# linetype in geom_hline to make the
# line dashed.
pf.fc_by_age_gender.wider$ratio <- 
  pf.fc_by_age_gender.wider$female / pf.fc_by_age_gender.wider$male

ggplot(pf.fc_by_age_gender.wider, aes(x = age, y = ratio)) + 
  geom_line() +
  geom_hline(aes(yintercept=1), linetype = 2)

```

***

### Third Quantitative Variable
Notes:

```{r Third Quantitative Variable}
# Create a variable called year_joined
# in the pf data frame using the variable
# tenure and 2014 as the reference year.

# The variable year joined should contain the year
# that a user joined facebook.

pf$year_joined <- floor(2014 - pf$tenure/365)
summary(pf$year_joined)
table(pf$year_joined)

```

***

### Cut a Variable
Notes:

```{r Cut a Variable}
# Create a new variable in the data frame
# called year_joined.bucket by using
# the cut function on the variable year_joined.

# You need to create the following buckets for the
# new variable, year_joined.bucket

#        (2004, 2009]
#        (2009, 2011]
#        (2011, 2012]
#        (2012, 2014]

pf$year_joined.bucket <- cut(pf$year_joined,
                             breaks=c(2004,2009,2011,2012,2014))
table(pf$year_joined.bucket)
```
***

### Plotting it All Together
Notes:

```{r Plotting it All Together}
# Create a line graph of friend_count vs. age
# so that each year_joined.bucket is a line
# tracking the median user friend_count across
# age. This means you should have four different
# lines on your plot.

# You should subset the data to exclude the users
# whose year_joined.bucket is NA.

ggplot(aes(x = age, y = friend_count), 
              data = subset(pf, !is.na(gender))) + 
  geom_line(aes(color = year_joined.bucket), stat = 'summary', fun.y = median)
```

***

### Plot the Grand Mean
Notes:

```{r Plot the Grand Mean}
# Write code to do the following:

# (1) Add another geom_line to code below
# to plot the grand mean of the friend count vs age.

# (2) Exclude any users whose year_joined.bucket is NA.

# (3) Use a different line type for the grand mean.

ggplot(aes(x = age, y = friend_count), 
              data = subset(pf, !is.na(gender) & !is.na(year_joined.bucket))) + 
  geom_line(aes(color = year_joined.bucket), stat = 'summary', fun.y = mean) +
  geom_line(stat = "summary", fun.y = mean, linetype = 2)

```

***

### Friending Rate
Notes:

```{r Friending Rate}
pf.tenure <- subset(pf, tenure > 0)

summary(pf$friending_rate)
```

***

### Friendships Initiated
Notes:

What is the median friend rate?
0.2205
What is the maximum friend rate?
417
```{r Friendships Initiated}
# Create a line graph of mean of friendships_initiated per day (of tenure)
# vs. tenure colored by year_joined.bucket.

ggplot(subset(pf,tenure >= 1), aes(x=tenure,y=friendships_initiated/tenure)) + 
  geom_smooth(aes(color = year_joined.bucket))
  geom_line(stat="summary",fun.y=mean, aes(color = year_joined.bucket))
```

***

### Bias-Variance Tradeoff Revisited
Notes:

```{r Bias-Variance Tradeoff Revisited}

ggplot(aes(x = tenure, y = friendships_initiated / tenure),
       data = subset(pf, tenure >= 1)) +
  geom_line(aes(color = year_joined.bucket),
            stat = 'summary',
            fun.y = mean)

ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket),
            stat = "summary",
            fun.y = mean)

ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket),
            stat = "summary",
            fun.y = mean)

ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket),
            stat = "summary",
            fun.y = mean)

```

***

### Sean's NFL Fan Sentiment Study
Notes:

***

### Introducing the Yogurt Data Set
Notes:

***

### Histograms Revisited
Notes:

```{r Histograms Revisited}
yo <- read.csv("./data/yogurt.csv")
summary(yo)
yo$id <- as.factor(yo$id)

ggplot(yo, aes(x=price)) + geom_histogram()

hist(yo$price)
```

***

### Number of Purchases
Notes:

```{r Number of Purchases}
# Create a new variable called all.purchases,
# which gives the total counts of yogurt for
# each observation or household.

# The transform function produces a data frame
# so if you use it then save the result to 'yo'!

?transform
yo <- transform(yo, all.purchases = strawberry + blueberry + 
                              pina.colada + plain + mixed.berry)
```

***

### Prices over Time
Notes:

```{r Prices over Time}
ggplot(yo, aes(x=time, y=price)) + geom_point(alpha = 1/10)
```

***

### Sampling Observations
Notes:

***

### Looking at Samples of Households

```{r Looking at Sample of Households}
set.seed(4230)
sample.ids <- sample(levels(yo$id),16)

ggplot(data = subset(yo, id %in% sample.ids),
       aes(x = time, y = price)) + 
  facet_wrap( ~ id ) + 
  geom_line() + 
  geom_point(aes(size = all.purchases), pch = 1) 

plotSampleOfHouseholds <- function(yo, seed){
set.seed(seed)
sample.ids <- sample(levels(yo$id),16)

ggplot(data = subset(yo, id %in% sample.ids),
       aes(x = time, y = price)) + 
  facet_wrap( ~ id ) + 
  geom_line() +
  geom_point(aes(size = all.purchases), pch = 1) + ggtitle(paste("Seed: ", seed))
}
plotSampleOfHouseholds(yo,1)
plotSampleOfHouseholds(yo,2)
plotSampleOfHouseholds(yo,3)
```

***

### The Limits of Cross Sectional Data
Notes:

***

### Many Variables
Notes:

***

### Scatterplot Matrix
Notes:

***
```{r}
pf_sample <- subset(pf, select = 2:15)

install.packages('psych')
library(psych)
library(dplyr)

set.seed(1836)
pairs.panels(pf_sample,pch=".")

```

### Even More Variables
Notes:

***

### Heat Maps
Notes:

```{r}
nci <- read.table("nci.tsv")
colnames(nci) <- c(1:64)
```

```{r}
nci.long.samp <- melt(as.matrix(nci[1:200,]))
names(nci.long.samp) <- c("gene", "case", "value")
head(nci.long.samp)

ggplot(aes(y = gene, x = case, fill = value),
  data = nci.long.samp) +
  geom_tile() +
  scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))
```


***

### Analyzing Three of More Variables
Reflection:

***

Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!