-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathHW2.Rmd
223 lines (129 loc) · 5.17 KB
/
HW2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "HW2"
author: "Amin Yakubu"
date: "3/16/2019"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = F)
knitr::opts_chunk$set(message = F)
```
```{r}
library(caret)
library(tidyverse)
library(gam)
library(boot)
library(mgcv)
```
Let's read in the data
```{r}
concrete_df = read_csv('data/concrete.csv')
attach(concrete_df)
```
```{r}
X = model.matrix(CompressiveStrength ~ ., concrete_df)[,-1]
y = concrete_df$CompressiveStrength
```
## Question A
Scatter plots to visualize the distribution of the variables
```{r}
theme1 <- trellis.par.get()
theme1$plot.symbol$col <- rgb(.3, .5, .2, .5)
theme1$plot.symbol$pch <- 18
theme1$plot.line$col <- rgb(.8, .1, .1, 1)
theme1$plot.line$lwd <- 2
theme1$strip.background$col <- rgb(.0, .2, .6, .2)
trellis.par.set(theme1)
featurePlot(X, y, plot = "scatter", labels = c("","Y"),
type = c("p"), layout = c(4, 2))
```
## Question B
Here I am using cross-validation to select the optimal degreedforthe polynomial. I'm doing to 5 so as to see if the test error will decrease or increase at 5.
```{r}
set.seed(1)
deltas <- rep(NA, 5)
for (i in 1:5) {
fit <- glm(CompressiveStrength ~ poly(Water, i), data = concrete_df)
deltas[i] <- cv.glm(concrete_df, fit, K = 10)$delta[1]
}
plot(1:5, deltas, xlab = "Degree", ylab = "Test MSE", type = "l")
d.min <- which.min(deltas)
points(which.min(deltas), deltas[which.min(deltas)], col = "red", cex = 2, pch = 20)
```
We see that the optimal d chosen is 4.
Now let's use ANOVA to test the subsets.
```{r}
fit1 = lm(CompressiveStrength ~ Water, data = concrete_df)
fit2 = lm(CompressiveStrength ~ poly(Water, 2), data = concrete_df)
fit3 = lm(CompressiveStrength ~ poly(Water, 3), data = concrete_df)
fit4 = lm(CompressiveStrength ~ poly(Water, 4), data = concrete_df)
fit5 = lm(CompressiveStrength ~ poly(Water, 5), data = concrete_df)
```
```{r}
anova(fit1, fit2, fit3, fit4, fit5)
```
Using ANOVA and by examining the p-values we see that degree 4 or 5 polynomial appear to provide a reasonable fit to the data. I will choose the 4 polynomial since we want a parsimonious model.
```{r}
plot(CompressiveStrength ~ Water, data = concrete_df, col = "red")
waterlims <- range(concrete_df$Water)
water.grid <- seq(from = waterlims[1], to = waterlims[2], by = 1)
fit <- lm(CompressiveStrength ~ poly(Water, 4), data = concrete_df)
preds <- predict(fit, newdata = data.frame(Water = water.grid))
lines(water.grid, preds, col = "blue", lwd = 2)
```
Plots of the fits
```{r}
plot(CompressiveStrength ~ Water, data = concrete_df, col = "blue")
preds1 <- predict(fit1, newdata = data.frame(Water = water.grid))
preds2 <- predict(fit2, newdata = data.frame(Water = water.grid))
preds3 <- predict(fit3, newdata = data.frame(Water = water.grid))
preds4 <- predict(fit4, newdata = data.frame(Water = water.grid))
lines(water.grid, preds1, col = "red", lwd = 2)
lines(water.grid, preds2, col = "blue", lwd = 2)
lines(water.grid, preds3, col = "green", lwd = 2)
lines(water.grid, preds4, col = "brown", lwd = 2)
# Add a legend
legend(200, 80, legend = c("degree 1", "degree 2", "degree 3", "degree 4"),
col = c("red", "blue", "green", "brown"), lty = 1:1, cex = 0.8)
```
## Question C
Here we are using the generalized cross-validation to choose the degrees of freedom
```{r}
fit.ss <- smooth.spline(concrete_df$Water, concrete_df$CompressiveStrength, keep.data = T)
fit.ss$df
pred.ss <- predict(fit.ss, x = water.grid)
pred.ss.df <- data.frame(pred = pred.ss$y,
water = water.grid)
p <- ggplot(data = concrete_df, aes(x = Water, y = CompressiveStrength)) +
geom_point(color = rgb(.2, .4, .2, .5))
p + geom_line(aes(x = water, y = pred), data = pred.ss.df,
color = rgb(.8, .1, .1, 1)) + theme_bw() +
labs(title = paste(round(fit.ss$df), 'degrees of freedom from GCV')) +
theme(plot.title = element_text(hjust = 0.5))
```
The chosen degrees of freedom 68.88. This is high and makes the model more wiggly.
Here we will fit for different degrees of freedom for 2 to 10.
```{r}
par(mfrow = c(3,3)) # 3 x 3 grid
all.dfs = rep(NA, 9)
for (i in 2:10) {
fit.ss = smooth.spline(concrete_df$Water, concrete_df$CompressiveStrength, df = i)
pred.ss <- predict(fit.ss, x = water.grid)
plot(concrete_df$Water, concrete_df$CompressiveStrength, cex = .5, col = "darkgrey")
title(paste("Degrees of freedom = ", round(fit.ss$df)), outer = F)
lines(water.grid, pred.ss$y, lwd = 2, col = "blue")
}
```
## Question D
Here I will find a GAM using all the predictors. From the feature plot, Water and Age doesn't look line so I'll fit a smooth spline on those variables.
```{r}
gam.m1 = mgcv::gam(CompressiveStrength ~ Cement + BlastFurnaceSlag + FlyAsh + s(Water) + Superplasticizer + CoarseAggregate + FineAggregate + s(Age), data = concrete_df)
par(mfrow = c(1,2))
plot(gam.m1)
```
```{r}
gam.m2 = mgcv::gam(CompressiveStrength ~ Cement + BlastFurnaceSlag + FlyAsh + Water + Superplasticizer + CoarseAggregate + FineAggregate + Age, data = concrete_df)
anova(gam.m1, gam.m2, test = 'F')
```
According the p-value the anova, non linear model with splines on Age and Water is better.