Computational_cost.Rmd

---
title: "Spike Computational cost"
author: "Gabriel Ristow Cidral / Sara Marin Lopez"
date: "11/04/2019"
output:
  rmdformats::readthedown:
    thumbnails: true
    lightbox: true
    toc_depth: 3
    gallery: true
    highlight: tango
---

<img style="float: right;" src="https://media.timtul.com/media/network22/ubiqum.png">


This spike is intended to provide ideas how of to make your code more efficient. The data used is the WIFI dataset.

It will approach 3 methods: Smart samples, parallel processing, modeling without caret, opmitization of random forest (mtry)

## SMART SAMPLES

First, imagine you want to sample the data to try different models faster. You could use the functions sample_n, but you would inccur the risk of having a undistributed sample (such as all observations in building 0) 


Load data 
```{r load data, message=F}

pacman::p_load(readr, dplyr, caret, plotly, htmltools)


train <- read_csv("trainingData.csv", na = c("N/A"))

```

Sample data
```{r create sample}

sample <- train %>% group_by(FLOOR, BUILDINGID) %>% sample_n(10)

```

check frequency floor
```{r table floor} 

table(sample$FLOOR)

```

check frequency building
```{r table building} 

table(sample$BUILDINGID)

```

plot sample - Building 0, Building 1, Building 2
```{r plotly1, echo=T, eval= T}

sample$BUILDINGID <- as.character(sample$BUILDINGID)


a <- htmltools::tagList()    
for(i in unique(sample$BUILDINGID)){
a[[i]] <- sample %>% dplyr:: filter(BUILDINGID == i) %>% plot_ly(type = "scatter3d",
        x = ~ LATITUDE,
        y = ~ LONGITUDE,
        z = ~ FLOOR,
        mode = 'markers')


}    
    
a[[1]] # Building 0
a[[2]] # Building 1
a[[3]] # Building 2
       

```

## SPECIFIC PACKAGES ##
### Random Forest: package randomForest ###
This is the most usual package for training a random forest. It's very user friendly and robust. If you want to learn more about other packages check this **[resource](https://www.linkedin.com/pulse/different-random-forest-packages-r-madhur-modi/)**.

Let's see which are the main parameters of the function **<span style="color:MIDNIGHTBLUE">[randomForest](https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest)</span>**:
<ul>
  <li> ntree: number of trees to grow </li>
  <li> mtry: how many random variables will be selected to grow in a single tree </li>
  <li> importance: should importance of predictors be assessed? *Keep in mind that if your data includes categorical variables with different number of levels, random forests are biased in favor of those variables with more levels.* </li>
</ul>

Another useful function from this package is **<span style="color:MIDNIGHTBLUE">[tuneRF()](https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/tuneRF)</span>**. Starting with the default value of mtry, it searchs for the optimal value. 

**Your turn! Try to obtain the best mtry for your data and train a random forest using this package and the caret package.**

```{r eval=FALSE, message=FALSE }
# Load package
library(randomForest)

# Saving the waps in a vector
WAPs<-grep("WAP", names(train), value=T)

# Get the best mtry
bestmtry_rf<-tuneRF(sample[WAPs], sample$LONGITUDE, ntreeTry=100,stepFactor=2,improve=0.05,trace=TRUE, plot=T) 

# Train a random forest using that mtry
system.time(rf_reg<-randomForest(y=sample$LONGITUDE,x=sample[WAPs],importance=T,method="rf", ntree=100, mtry=22))

# Train a random forest using caret package
system.time(rf_reg_caret<-train(y=sample$LONGITUDE, x=sample[WAPs], data = sample, method="rf", ntree=100,tuneGrid=expand.grid(.mtry=22)))
```

### KNN: caret package ###
Explore the main parameters of these functions **<span style="color:MIDNIGHTBLUE">[knn3()](https://www.rdocumentation.org/packages/caret/versions/6.0-81/topics/knn3)</span>** for classification and **<span style="color:MIDNIGHTBLUE">[knnreg()](https://www.rdocumentation.org/packages/caret/versions/6.0-81/topics/knnreg)</span>** for regression:

**Train two knn models with these packages and the caret package:**
```{r eval=FALSE, message=FALSE}
# Load the package
library(caret)

# Saving the waps in a vector
WAPs<-grep("WAP", names(sample), value=T)

# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(sample[WAPs], method=c("center", "scale"))

# transform the waps using the parameters
stand_waps <- predict(preprocessParams, sample[WAPs])

# complete dataset
stand_dataset<-cbind(stand_waps, BUILDINGID=sample$BUILDINGID,LONGITUDE=sample$LONGITUDE)

# Train two classification knn (with knn3 and train)
system.time(knn_clasif <- knn3(BUILDINGID ~ as.matrix(stand_dataset[WAPs]), data = stand_dataset))

system.time(knn_clasif_caret<-train(y=stand_dataset$BUILDINGID, x=stand_dataset[WAPs], data = stand_dataset, method="knn"))

# Train two regression knn (with knnreg and caret)
system.time(knn_reg<-knnreg(LONGITUDE ~ as.matrix(stand_dataset[WAPs]), data = stand_dataset))

system.time(knn_reg_caret<-train(y=stand_dataset$LONGITUDE, x=stand_dataset[WAPs], data = stand_dataset, method="knn"))

```

### SVM: e1071 package  ###
Explore the main parameters of these functions **<span style="color:MIDNIGHTBLUE">[svm()](https://cran.r-project.org/web/packages/e1071/e1071.pdf)</span>** for classification and regression. 

Read this resource for more info **<span style="color:MIDNIGHTBLUE">[svm()](https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf)</span>** for classification and regression. 

**Train two svm models with these packages and the caret package**
``` {r eval=FALSE, message=FALSE}
# Load the packages
library(e1071)
library(caret)

# Saving the waps in a vector
WAPs<-grep("WAP", names(sample), value=T)

# Train two classification svm (with svm and train)
system.time(svm_clasif <- svm(y = stand_dataset$BUILDINGID, x=stand_dataset[WAPs]))

system.time(svm_clasif_caret<-train(y=stand_dataset$BUILDINGID, x=stand_dataset[WAPs], data = stand_dataset, method="svmLinear"))

# Train two regression svm (with svm and train)
system.time(svm_reg <- svm(y = stand_dataset$LONGITUDE, x=stand_dataset[WAPs]))

system.time(svm_reg_caret<-train(y=stand_dataset$LONGITUDE, x=as.matrix(stand_dataset[WAPs], data = stand_dataset, method="svmLinear")))

 
```


## PARALLEL PROCESSING ##
A computer usually has multiple cores. Tipically, R is going to use only one of them, but we can increase this number, allowing us to execute more computations at the same time. 

**How to do it on Windows**
<ul>
  <li>Install the doParallel package </li>
  <li>Check how many cores you have with the function **<span style="color:MIDNIGHTBLUE">detectCores()</span>**.</li>
  <li>Save the number of cores that you would like to execute with the function **<span style="color:MIDNIGHTBLUE ">makeCluster()</span>**. A good practice is to leave one for other tasks. </li>
  <li>Register the cluster with the function **<span style="color:MIDNIGHTBLUE">registerDoParallel()</span>**</li>
</ul>

**How to do it on Mac/Linux**
<ul>
  <li>Install the doMC package </li>
  <li>Check how many cores you have with the function **<span style="color:MIDNIGHTBLUE">getDoParWorkers()</span>** </li>
  <li>Save the number of cores that you would like to execute with the function **<span style="color:MIDNIGHTBLUE">makeCluster()</span>**. A good practice is to leave one for other tasks. </li>
  <li>Register the cluster with the function **<span style="color:MIDNIGHTBLUE">registerDoMC()</span>**</li>
</ul>

Now you can apply parallel processing! For example, you can use it in the cross validation or in the RF with the parameter "allowParallel = TRUE". 

**Challenge: Train the same sample with parallel processing**
```{r eval=FALSE,message=FALSE}
# Load the library
library(doParallel)

# Check number of cores
detectCores()

# Save the number of cores I'm going to use
cluster <- makeCluster(detectCores() - 1)

# Register the cluster
registerDoParallel(cluster)

# Apply it on the cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, allowParallel = TRUE)

```
## SAVING AND LOADING MODELS ##
You can save your best models to a file. This way, you will be able to load/share them later. 
<ul>
<li> For saving a model you can mainly use two functions: **<span style="color:MIDNIGHTBLUE">save(____.rda)</span>** or **<span style="color:MIDNIGHTBLUE">saveRDS(____.rds)</span>** </li>
<li> For loading a model you will need to use **<span style="color:MIDNIGHTBLUE">load(____.rda)</span>** or **<span style="color:MIDNIGHTBLUE">readRDS(____.rds)</span>**
</ul>

**Your turn! Try to save and load some models.**
```{r eval=FALSE, message=FALSE}
# Save a model
saveRDS(RF_Model, file="RF_Model.rds")

# Load a model
final_model<-readRD("RF_Model.rds")
```