-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathComputational_cost.Rmd
229 lines (153 loc) · 8.57 KB
/
Computational_cost.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
title: "Spike Computational cost"
author: "Gabriel Ristow Cidral / Sara Marin Lopez"
date: "11/04/2019"
output:
rmdformats::readthedown:
thumbnails: true
lightbox: true
toc_depth: 3
gallery: true
highlight: tango
---
<img style="float: right;" src="https://media.timtul.com/media/network22/ubiqum.png">
This spike is intended to provide ideas how of to make your code more efficient. The data used is the WIFI dataset.
It will approach 3 methods: Smart samples, parallel processing, modeling without caret, opmitization of random forest (mtry)
## SMART SAMPLES
First, imagine you want to sample the data to try different models faster. You could use the functions sample_n, but you would inccur the risk of having a undistributed sample (such as all observations in building 0)
Load data
```{r load data, message=F}
pacman::p_load(readr, dplyr, caret, plotly, htmltools)
train <- read_csv("trainingData.csv", na = c("N/A"))
```
Sample data
```{r create sample}
sample <- train %>% group_by(FLOOR, BUILDINGID) %>% sample_n(10)
```
check frequency floor
```{r table floor}
table(sample$FLOOR)
```
check frequency building
```{r table building}
table(sample$BUILDINGID)
```
plot sample - Building 0, Building 1, Building 2
```{r plotly1, echo=T, eval= T}
sample$BUILDINGID <- as.character(sample$BUILDINGID)
a <- htmltools::tagList()
for(i in unique(sample$BUILDINGID)){
a[[i]] <- sample %>% dplyr:: filter(BUILDINGID == i) %>% plot_ly(type = "scatter3d",
x = ~ LATITUDE,
y = ~ LONGITUDE,
z = ~ FLOOR,
mode = 'markers')
}
a[[1]] # Building 0
a[[2]] # Building 1
a[[3]] # Building 2
```
## SPECIFIC PACKAGES ##
### Random Forest: package randomForest ###
This is the most usual package for training a random forest. It's very user friendly and robust. If you want to learn more about other packages check this **[resource](https://www.linkedin.com/pulse/different-random-forest-packages-r-madhur-modi/)**.
Let's see which are the main parameters of the function **<span style="color:MIDNIGHTBLUE">[randomForest](https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest)</span>**:
<ul>
<li> ntree: number of trees to grow </li>
<li> mtry: how many random variables will be selected to grow in a single tree </li>
<li> importance: should importance of predictors be assessed? *Keep in mind that if your data includes categorical variables with different number of levels, random forests are biased in favor of those variables with more levels.* </li>
</ul>
Another useful function from this package is **<span style="color:MIDNIGHTBLUE">[tuneRF()](https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/tuneRF)</span>**. Starting with the default value of mtry, it searchs for the optimal value.
**Your turn! Try to obtain the best mtry for your data and train a random forest using this package and the caret package.**
```{r eval=FALSE, message=FALSE }
# Load package
library(randomForest)
# Saving the waps in a vector
WAPs<-grep("WAP", names(train), value=T)
# Get the best mtry
bestmtry_rf<-tuneRF(sample[WAPs], sample$LONGITUDE, ntreeTry=100,stepFactor=2,improve=0.05,trace=TRUE, plot=T)
# Train a random forest using that mtry
system.time(rf_reg<-randomForest(y=sample$LONGITUDE,x=sample[WAPs],importance=T,method="rf", ntree=100, mtry=22))
# Train a random forest using caret package
system.time(rf_reg_caret<-train(y=sample$LONGITUDE, x=sample[WAPs], data = sample, method="rf", ntree=100,tuneGrid=expand.grid(.mtry=22)))
```
### KNN: caret package ###
Explore the main parameters of these functions **<span style="color:MIDNIGHTBLUE">[knn3()](https://www.rdocumentation.org/packages/caret/versions/6.0-81/topics/knn3)</span>** for classification and **<span style="color:MIDNIGHTBLUE">[knnreg()](https://www.rdocumentation.org/packages/caret/versions/6.0-81/topics/knnreg)</span>** for regression:
**Train two knn models with these packages and the caret package:**
```{r eval=FALSE, message=FALSE}
# Load the package
library(caret)
# Saving the waps in a vector
WAPs<-grep("WAP", names(sample), value=T)
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(sample[WAPs], method=c("center", "scale"))
# transform the waps using the parameters
stand_waps <- predict(preprocessParams, sample[WAPs])
# complete dataset
stand_dataset<-cbind(stand_waps, BUILDINGID=sample$BUILDINGID,LONGITUDE=sample$LONGITUDE)
# Train two classification knn (with knn3 and train)
system.time(knn_clasif <- knn3(BUILDINGID ~ as.matrix(stand_dataset[WAPs]), data = stand_dataset))
system.time(knn_clasif_caret<-train(y=stand_dataset$BUILDINGID, x=stand_dataset[WAPs], data = stand_dataset, method="knn"))
# Train two regression knn (with knnreg and caret)
system.time(knn_reg<-knnreg(LONGITUDE ~ as.matrix(stand_dataset[WAPs]), data = stand_dataset))
system.time(knn_reg_caret<-train(y=stand_dataset$LONGITUDE, x=stand_dataset[WAPs], data = stand_dataset, method="knn"))
```
### SVM: e1071 package ###
Explore the main parameters of these functions **<span style="color:MIDNIGHTBLUE">[svm()](https://cran.r-project.org/web/packages/e1071/e1071.pdf)</span>** for classification and regression.
Read this resource for more info **<span style="color:MIDNIGHTBLUE">[svm()](https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf)</span>** for classification and regression.
**Train two svm models with these packages and the caret package**
``` {r eval=FALSE, message=FALSE}
# Load the packages
library(e1071)
library(caret)
# Saving the waps in a vector
WAPs<-grep("WAP", names(sample), value=T)
# Train two classification svm (with svm and train)
system.time(svm_clasif <- svm(y = stand_dataset$BUILDINGID, x=stand_dataset[WAPs]))
system.time(svm_clasif_caret<-train(y=stand_dataset$BUILDINGID, x=stand_dataset[WAPs], data = stand_dataset, method="svmLinear"))
# Train two regression svm (with svm and train)
system.time(svm_reg <- svm(y = stand_dataset$LONGITUDE, x=stand_dataset[WAPs]))
system.time(svm_reg_caret<-train(y=stand_dataset$LONGITUDE, x=as.matrix(stand_dataset[WAPs], data = stand_dataset, method="svmLinear")))
```
## PARALLEL PROCESSING ##
A computer usually has multiple cores. Tipically, R is going to use only one of them, but we can increase this number, allowing us to execute more computations at the same time.
**How to do it on Windows**
<ul>
<li>Install the doParallel package </li>
<li>Check how many cores you have with the function **<span style="color:MIDNIGHTBLUE">detectCores()</span>**.</li>
<li>Save the number of cores that you would like to execute with the function **<span style="color:MIDNIGHTBLUE ">makeCluster()</span>**. A good practice is to leave one for other tasks. </li>
<li>Register the cluster with the function **<span style="color:MIDNIGHTBLUE">registerDoParallel()</span>**</li>
</ul>
**How to do it on Mac/Linux**
<ul>
<li>Install the doMC package </li>
<li>Check how many cores you have with the function **<span style="color:MIDNIGHTBLUE">getDoParWorkers()</span>** </li>
<li>Save the number of cores that you would like to execute with the function **<span style="color:MIDNIGHTBLUE">makeCluster()</span>**. A good practice is to leave one for other tasks. </li>
<li>Register the cluster with the function **<span style="color:MIDNIGHTBLUE">registerDoMC()</span>**</li>
</ul>
Now you can apply parallel processing! For example, you can use it in the cross validation or in the RF with the parameter "allowParallel = TRUE".
**Challenge: Train the same sample with parallel processing**
```{r eval=FALSE,message=FALSE}
# Load the library
library(doParallel)
# Check number of cores
detectCores()
# Save the number of cores I'm going to use
cluster <- makeCluster(detectCores() - 1)
# Register the cluster
registerDoParallel(cluster)
# Apply it on the cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, allowParallel = TRUE)
```
## SAVING AND LOADING MODELS ##
You can save your best models to a file. This way, you will be able to load/share them later.
<ul>
<li> For saving a model you can mainly use two functions: **<span style="color:MIDNIGHTBLUE">save(____.rda)</span>** or **<span style="color:MIDNIGHTBLUE">saveRDS(____.rds)</span>** </li>
<li> For loading a model you will need to use **<span style="color:MIDNIGHTBLUE">load(____.rda)</span>** or **<span style="color:MIDNIGHTBLUE">readRDS(____.rds)</span>**
</ul>
**Your turn! Try to save and load some models.**
```{r eval=FALSE, message=FALSE}
# Save a model
saveRDS(RF_Model, file="RF_Model.rds")
# Load a model
final_model<-readRD("RF_Model.rds")
```