HW1.Rmd

---
title: "Homework 1"
author: "Amin Yakubu"
date: "2/23/2019"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = F)
knitr::opts_chunk$set(message = F)
```

```{r}
library(tidyverse)
library(caret)
```

# Data 

```{r}
train = read_csv('./data/solubility_train.csv')
test = read_csv('./data/solubility_test.csv')
```

The data has been divided into training and testing. The testing data has 951 observations and the test data has 356 observations. There are `r ncol(train)` predictors. `r sum(apply(train,2,function(x) { all(x %in% 0:1) }))` are binary variables that indicate the presence or absence of a particular chemical substructure, 16 are count descriptors, such as the number of bonds or the number of bromine atoms, and 4 are continuous descriptors, such as molecular weight or surface area. The response is in the column `Solubility` which is a continuous variable


```{r}
# Checking for missing values
missing_train <- sapply(train, function(x) sum(is.na(x)))
print(missing_train[missing_train > 0])

missing_test <- sapply(test, function(x) sum(is.na(x)))
print(missing_test[missing_test > 0])
```

No missing data 

```{r}
hist(train$Solubility)
```

Data processing 
```{r}
#Training set
X.train = model.matrix(Solubility ~ ., train)[,-1]
y.train = train$Solubility

#Testing set
X.test = model.matrix(Solubility ~ ., test)[,-1]
y.test = test$Solubility

# Validation control
ctrl1 <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
```

## Question 1 -- Linear Model 

```{r}
set.seed(2)
lm.fit <- train(X.train, y.train,
                method = "lm",
                trControl = ctrl1)

pred.lm <- predict(lm.fit$finalModel, newdata = data.frame(X.test))

print(mean((pred.lm - y.test)^2))
```

The test MSE is 0.5558898

## Question 2 -- Ridge Regression

```{r}
set.seed(2)
ridge.fit <- train(X.train, y.train,
                     method = "glmnet",
                     tuneGrid = expand.grid(alpha = 0, 
                                            lambda = exp(seq(-10, 10, length = 200))),
                     trControl = ctrl1)

plot(ridge.fit)
plot(ridge.fit, xTrans = function(x) log(x)) # here were are plotting log lambda so it looks like the previous plots

ridge.fit$bestTune

```

```{r}
bestlam.ridge = ridge.fit$bestTune$lambda
bestlam.ridge
```

```{r}
ridge.pred = predict(ridge.fit$finalModel, s = bestlam.ridge, newx = X.test)
print(mean((ridge.pred - y.test)^2))
```

The mean test error is 0.51346

## Question 3 -- The Lasso

```{r}
set.seed(2)
lasso.fit <- train(X.train, y.train,
                     method = "glmnet",
                     tuneGrid = expand.grid(alpha = 1, 
                                            lambda = exp(seq(-10,10, length = 200))),
                     trControl = ctrl1)

plot(lasso.fit)
plot(lasso.fit, xTrans = function(x) log(x)) 

```

```{r}
bestlam.lasso = lasso.fit$bestTune$lambda
bestlam.lasso
```

```{r}
lasso.pred = predict(lasso.fit$finalModel, s = bestlam.lasso, newx = X.test)
print(mean((lasso.pred - y.test)^2))
```

The mean error is 0.496. 

```{r}
lasso.coef = predict(lasso.fit$finalModel, type = "coefficients", s = bestlam.lasso)[1:ncol(train),]
length(lasso.coef)
```

```{r}
length(lasso.coef[lasso.coef != 0])
```

There are 144 non-zero coefficient 

## Question 4 -- PCR 

```{r}
set.seed(2)
pcr.fit <- train(X.train, y.train,
                  method = "pcr",
                  tuneLength = 228,
                  trControl = ctrl1,
                  scale = TRUE)

pred.pcr <- predict(pcr.fit$finalModel, newdata = X.test, 
                       ncomp = pcr.fit$bestTune$ncomp)

mean((pred.pcr - y.test)^2)

ggplot(pcr.fit, highlight = TRUE) + theme_bw()

```

The mean test error is 0.54055

Question 5 -- Discussion 

```{r}
resamp <- resamples(list(lasso = lasso.fit, 
                         ridge = ridge.fit, 
                         pcr = pcr.fit,
                         lm = lm.fit))
print(summary(resamp))

bwplot(resamp, metric = "RMSE")
```

RMSE is minimum for the lasso compared to Rigde, PCR and linear method. This means the the coefficient for some of the predictors are truly zero. For this particular dataset, the Lasso provides best fit. Ridge regression is the next best model and the linear model is the worst. This means the shrinking coefficient is helpful. PCR with 149 components results in a model comparable to the linear model.