-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathHW1.Rmd
173 lines (126 loc) · 4.35 KB
/
HW1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
title: "Homework 1"
author: "Amin Yakubu"
date: "2/23/2019"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = F)
knitr::opts_chunk$set(message = F)
```
```{r}
library(tidyverse)
library(caret)
```
# Data
```{r}
train = read_csv('./data/solubility_train.csv')
test = read_csv('./data/solubility_test.csv')
```
The data has been divided into training and testing. The testing data has 951 observations and the test data has 356 observations. There are `r ncol(train)` predictors. `r sum(apply(train,2,function(x) { all(x %in% 0:1) }))` are binary variables that indicate the presence or absence of a particular chemical substructure, 16 are count descriptors, such as the number of bonds or the number of bromine atoms, and 4 are continuous descriptors, such as molecular weight or surface area. The response is in the column `Solubility` which is a continuous variable
```{r}
# Checking for missing values
missing_train <- sapply(train, function(x) sum(is.na(x)))
print(missing_train[missing_train > 0])
missing_test <- sapply(test, function(x) sum(is.na(x)))
print(missing_test[missing_test > 0])
```
No missing data
```{r}
hist(train$Solubility)
```
Data processing
```{r}
#Training set
X.train = model.matrix(Solubility ~ ., train)[,-1]
y.train = train$Solubility
#Testing set
X.test = model.matrix(Solubility ~ ., test)[,-1]
y.test = test$Solubility
# Validation control
ctrl1 <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
```
## Question 1 -- Linear Model
```{r}
set.seed(2)
lm.fit <- train(X.train, y.train,
method = "lm",
trControl = ctrl1)
pred.lm <- predict(lm.fit$finalModel, newdata = data.frame(X.test))
print(mean((pred.lm - y.test)^2))
```
The test MSE is 0.5558898
## Question 2 -- Ridge Regression
```{r}
set.seed(2)
ridge.fit <- train(X.train, y.train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0,
lambda = exp(seq(-10, 10, length = 200))),
trControl = ctrl1)
plot(ridge.fit)
plot(ridge.fit, xTrans = function(x) log(x)) # here were are plotting log lambda so it looks like the previous plots
ridge.fit$bestTune
```
```{r}
bestlam.ridge = ridge.fit$bestTune$lambda
bestlam.ridge
```
```{r}
ridge.pred = predict(ridge.fit$finalModel, s = bestlam.ridge, newx = X.test)
print(mean((ridge.pred - y.test)^2))
```
The mean test error is 0.51346
## Question 3 -- The Lasso
```{r}
set.seed(2)
lasso.fit <- train(X.train, y.train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1,
lambda = exp(seq(-10,10, length = 200))),
trControl = ctrl1)
plot(lasso.fit)
plot(lasso.fit, xTrans = function(x) log(x))
```
```{r}
bestlam.lasso = lasso.fit$bestTune$lambda
bestlam.lasso
```
```{r}
lasso.pred = predict(lasso.fit$finalModel, s = bestlam.lasso, newx = X.test)
print(mean((lasso.pred - y.test)^2))
```
The mean error is 0.496.
```{r}
lasso.coef = predict(lasso.fit$finalModel, type = "coefficients", s = bestlam.lasso)[1:ncol(train),]
length(lasso.coef)
```
```{r}
length(lasso.coef[lasso.coef != 0])
```
There are 144 non-zero coefficient
## Question 4 -- PCR
```{r}
set.seed(2)
pcr.fit <- train(X.train, y.train,
method = "pcr",
tuneLength = 228,
trControl = ctrl1,
scale = TRUE)
pred.pcr <- predict(pcr.fit$finalModel, newdata = X.test,
ncomp = pcr.fit$bestTune$ncomp)
mean((pred.pcr - y.test)^2)
ggplot(pcr.fit, highlight = TRUE) + theme_bw()
```
The mean test error is 0.54055
Question 5 -- Discussion
```{r}
resamp <- resamples(list(lasso = lasso.fit,
ridge = ridge.fit,
pcr = pcr.fit,
lm = lm.fit))
print(summary(resamp))
bwplot(resamp, metric = "RMSE")
```
RMSE is minimum for the lasso compared to Rigde, PCR and linear method. This means the the coefficient for some of the predictors are truly zero. For this particular dataset, the Lasso provides best fit. Ridge regression is the next best model and the linear model is the worst. This means the shrinking coefficient is helpful. PCR with 149 components results in a model comparable to the linear model.