forked from diegobodas/machine_learning_basics
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path6.NaiveBayesClassification.Rmd
215 lines (167 loc) · 7.42 KB
/
6.NaiveBayesClassification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
title: "Naive Bayes Classifier"
author: "Diego J. Bodas Sagi"
date: "15 October 2017"
output:
html_document:
toc: true
toc_float:
collapsed: false
smooth_scroll: false
number_sections: true
---
# Introduction
In machine learning, **Naive Bayes Classifiers** are a family of simple probabilistic classifiers based on applying **Bayes' theorem** with strong (*naive*) independence assumptions between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem (@naiveBayesClassifier).
Independent events are those which, upon occurrence, do not affect the probability of another event occurring. An example would be a child in Madrid scoring a goal at school and me eating pasta at the same day. Dependent events are those which, upon occurrence, affect the probability of another event occurring. An example would be getting drunk one night and suffering headaches the following day. The more you drink the greater your chance of suffering a headache the following day. When two or more events are mutually exclusive, they cannot occur simultaneously. The probility is zero. I.e., It is imposible to be on holidays in Cantabria and in Mexico at the same time. These two outcomes are mutually exclusive.
The Naive Bayes Classifier are based on the famous Naive Bayes’ Theorem:
$P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A)P(B|A)}{P(B)}$
To understand this, consider A as outcome and B and evidence. It is with this formula that the Naive Bayes classifier calculates conditional probabilities for a class outcome given prior information or evidence (our attributes in this case). The reason it is termed “naive” is because we assume independence between attributes when in reality they may be dependent in some way.
# Probabilistic model
When we have a vector, we have to use the chain rule and use some maths:
$P(C_{k},x_{1},\dots ,x_{n}) =P(x_{1}\mid x_{2},\dots ,x_{n},C_{k})P(x_{2},\dots ,x_{n},C_{k})\\ =P(x_{1}\mid x_{2},\dots ,x_{n},C_{k})P(x_{2}\mid x_{3},\dots ,x_{n},C_{k})P(x_{3},\dots ,x_{n},C_{k})\\ =\dots \\ =P(x_{1}\mid x_{2},\dots ,x_{n},C_{k})P(x_{2}\mid x_{3},\dots ,x_{n},C_{k})\dots P(x_{n-1}\mid x_{n},C_{k})P(x_{n}\mid C_{k})P(C_{k})$
# Hands-on
Let's play with Naive Bayes Classification step by step example (getting our hands dirty).
Get the data
```{r}
#---------- load library ----------
if("mlbench" %in% rownames(installed.packages()) == FALSE) {install.packages("mlbench", dependencies = TRUE)}
library('mlbench')
# load data
data(BostonHousing2)
# info about the data
str(BostonHousing2)
summary(BostonHousing2)
```
The dataframe BostonHousing contains the original data by @harrisonRubinfeld, the dataframe BostonHousing2 includes additional spatial information. Get more information about this dataset in @mlbench. The target variable is *medv*, representing the median value of owner-occupied homes in USD 1000's.
We categorize the age for training purposes
```{r}
# Using cut
BostonHousing2$age_range <- cut(BostonHousing2$age, breaks = c(-Inf, 20, 30, 40, 60, Inf),
labels = c(0, 1, 2, 3, 4),
right = FALSE)
BostonHousing2$age <- NULL
```
For a simple example, we transform the *medv* variable into a simpler one meaning under o above the mean.
```{r}
BostonHousing2$homes_range <- as.factor(ifelse(BostonHousing2$medv <
((max(BostonHousing2$medv) - min(BostonHousing2$medv))/2), 0, 1))
# delete medv variable
BostonHousing2$medv <- NULL
```
Examine the new variable
```{r}
plot(BostonHousing2$homes_range)
title(main = "Home range", xlab = "Home Range", ylab = "frequency")
#by age
plot(BostonHousing2[BostonHousing2$age_range == 0, ncol(BostonHousing2)])
title(main = "Home Range for under 20", xlab = "Home Range", ylab = "frequency")
```
Do not forget the quality control
```{r}
ages_na <- which(is.na(BostonHousing2$age_range) == TRUE)
homes_na <- which(is.na(BostonHousing2$homes_range) == TRUE)
# no NAs in this case
```
We have already explired the dataset. Now, compute the conditional probability of $homeRange = 0$ when $ageRange = 4 (senior)$
$P(homeRange = 0 | ageRange = 4) = \frac{P(homeRange = 0 \cap ageRange = 4)}{P(ageRange = 4)}$
```{r}
# Exercise:
# Compute: P(homeRange = 0 | ageRange = 4)
```
When we have to take into account all the variables the maths get complex. Luckily, the *e1071* R package is available to help us. But first, prepare the test and training sets.
```{r}
# divide into test and training sets
# create new col "train"" and assign 1 or 0 in 80/20 proportion via random uniform dist
BostonHousing2[, "train"] <- ifelse(runif(nrow(BostonHousing2)) < 0.80, 1, 0)
# get col number of train / test indicator column (needed later)
trainColNum <- grep("train", names(BostonHousing2))
# separate training and test sets and remove training column before modeling
trainBostonHousing2 <-BostonHousing2[BostonHousing2$train == 1,-trainColNum]
testBostonHousing2 <- BostonHousing2[BostonHousing2$train == 0,-trainColNum]
```
Now, do the maths using the *e1071* package
```{r}
if("e1071" %in% rownames(installed.packages()) == FALSE) {install.packages("e1071", dependencies = TRUE)}
library('e1071')
naive_bayes_model <- naiveBayes(homes_range ~., data = trainBostonHousing2)
naive_bayes_model
summary(naive_bayes_model)
str(naive_bayes_model)
```
Evaluate the model
```{r}
naive_bayes_test_predict <- predict(naive_bayes_model, testBostonHousing2[, -ncol(testBostonHousing2)])
#confusion matrix
table(pred = naive_bayes_test_predict, true = testBostonHousing2$homes_range)
```
Correct predictions
```{r}
#fraction of correct predictions
mean(naive_bayes_test_predict == testBostonHousing2$homes_range)
```
Bad or good result?
# Assignment
1. Encapsulate the code creating **functions**
2. Test more executions (remember we are using a random function to create the training and test set)
3. Test other training fractions (0.6, 0.7, 0.8, 0.9)
4. Try out for yourself, create new ranges, convert formats to factor... How to improve the results? A tournament is open
5. Try this technique with the dataset from @kaggle challenge
Use the RMarkdown format
# References
---
references:
- id: template
title: Binary classification
author:
- family: Fenner
given: Martin
container-title: Nature Materials
volume: 11
URL: 'http://dx.doi.org/10.1038/nmat3283'
DOI: 10.1038/nmat3283
issue: 4
publisher: Nature Publishing Group
page: 261-263
type: article-journal
issued:
year: 2012
month: 3
- id: naiveBayesClassifier
title: Naive Bayes Classifier
author:
- family: Wikipedia - Naive Bayes Classifier
URL: 'https://en.wikipedia.org/wiki/Naive_Bayes_classifier'
issued:
year: 2017
- id: mlbench
title: Package 'mlbench'
author:
- family: Leisch
given: Friedrich
- family: Dimitriadou
given: Evgenia
URL: 'https://cran.r-project.org/web/packages/mlbench/mlbench.pdf'
issued:
year: 2015
- id: harrisonRubinfeld
title: Hedonic prices and the demand for clean air
author:
- family: Harrison
given: D.
- family: Rubinfeld
given: D.L.
container-title: Journal of Environmental Economics and Management
volume: 5
publisher: Elsevier
page: 81–102
type: article-journal
issued:
year: 1978
- id: kaggle
title: Give Me Some Credit
author:
- family: Kaggle
URL: 'https://www.kaggle.com/c/GiveMeSomeCredit'
issued:
year: 2017
---