-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtutorial-logistic.Rmd
295 lines (222 loc) · 10.2 KB
/
tutorial-logistic.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
---
title: "Logistic regression"
output:
learnr::tutorial:
theme: cerulean
progressive: yes
allow_skip: yes
runtime: shiny_prerendered
resource_files:
- tutorial-logistic.Rmd
- pokemon.csv
- tutorial-logistic.Rmd
- pokemon.csv
- tutorial-logistic.Rmd
- pokemon.csv
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
```{r packages-data, echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(knitr)
library(broom)
library(learnr)
pokemon <- read_csv("pokemon.csv")
```
```{css, echo=FALSE}
.spoiler {
visibility: hidden;
}
.spoiler::before {
visibility: visible;
content: "ANSWER: Hover over me!"
}
.spoiler:hover {
visibility: visible;
}
.spoiler:hover::before {
display: none;
}
# Credit to RLesur on stack overflow (question 51020181) for CSS idea
```
## Today's data
The `pokemon` dataset contains data on all currently extant Pokemon (as of
August 1, 2020), including alternate forms and mega evolutions. The
dataset contains information including its name, baseline battle statistics
(base stats), types, and whether they are "legendary" (broadly defined).
Pokemon are
fantastic creatures with unique characteristics, and may be used to battle each
other in combat. The battle-effectiveness of a Pokemon is in part determined by
their "base stats," which consist of HP or "hit points", attack, defense,
special attack, special defense, and speed. Some Pokemon are considered
"legendary" if they are exceptionally rare (in some sense). Legendary Pokemon
are often, but not always, more powerful than their non-legendary counterparts.
*Pokemon and Pokemon names are trademarks of Nintendo.*
Let's read in the dataset and create a new variable `legendary`.
```{r, message = F, warning = F}
pokemon <- read_csv("pokemon.csv")
pokemon <- pokemon %>%
mutate(legendary = ifelse(leg_status %in% c("Legendary", "Mythical"), 1, 0))
```
Some important variables in the dataset are listed below:
- `pokedex_number`: a catalogue number of each Pokemon
- `name`: the name of the Pokemon
- `hp`, `atk`, `def`, `spa`, `spd`, `spe`: a Pokemon's battle statistics (HP,
attack, defense, special attack, special defense, and speed, respectively)
- `bst`: the total base stats of the Pokemon
- `legendary`: whether the Pokemon is legendary
- `type_1`: the primary type of the Pokemon
## Non-continuous outcomes
In previous tutorials, we've focused on using linear regression as a tool to
- Make predictions about new observations
- Describe relationships
- Perform statistical inference
These examples have all had *continuous* response variables. But can we do
similar tasks for **binary categorical** response variables too? Today's goals
are to predict whether a Pokemon is a legendary based on its base stats and
describe the relationship between stats and *probability* of being legendary.
Suppose we have some hypothetical Pokemon with the following base stats. Would
we classify them as legendary Pokemon based on these characteristics?
```{r, echo = F}
new_pokes <- tibble(Pokemon = c("Crapistat", "Mediocra", "Literally Dragonite", "Broaken"),
HP = c(55, 90, 91, 104),
ATK = c(25, 110, 134, 125),
DEF = c(30, 130, 95, 105),
SPA = c(60, 75, 100, 148),
SPD = c(50, 80, 100, 102),
SPE = c(102, 45, 80, 136))
kable(new_pokes)
```
## Problems with linear models
Suppose we consider the following model for $p$, the probability of being a legendary Pokemon:
$${p} = \beta_0 + \beta_1HP + \beta_2ATK + \cdots + \beta_6\times SPE$$
Take a moment and think about what might go wrong when we use this model.
Let's create this model and create some diagnostic plots. Click on `Run Code` in
each of the code fields below to see the residual plot and a histogram of the residuals.
```{r resid-plot, exercise = TRUE, exercise.eval = FALSE}
m2 <- lm(legendary ~ hp + atk + def + spa + spd + spe, data = pokemon)
ggplot(data = augment(m2), aes(x = .fitted, y = .resid)) +
geom_point() +
labs(x = "Predicted", y = "Residual", title = "Residual plot")
```
```{r resid-hist, exercise = TRUE, exercise.eval = FALSE}
ggplot(data = augment(m2), aes(x = .fitted)) +
geom_histogram() +
labs(x = "Predicted Values", y = "Count")
```
**What problems do you see?**
[We see that the linear model assumptions are definitely not satisfied. Also, in
looking at the predicted values, many of the Pokemon are predicted to have
*negative* probabilities of being legendary (which doesn't make sense at all!).
Similarly, it's possible to get predicted probabilities greater than 1 in such
a linear probability model.]{.spoiler}
## Introducing log-odds
Linear models can predict any value from $-\infty$ to $\infty$, but
probabilities are restricted to lie between 0 and 1. Is there a way to create a
model for a *transformation* of the probabilitiy in a principled way such that
it's not a problem if we make predictions that can take on any value?
Suppose the probability of an event is $p$ Then the **odds** that the event
occurs is $\frac{p}{1-p}$. Taking the natural log of the odds, we have the
**logit** (or log-odds) of $p$:
$$logit(p) = \log\left(\frac{p}{1-p}\right)$$
Note that in statistics, when we say "log" we always mean a logarithm with the
natural base $e$. By taking this **logit transformation**, although $p$ is
constrained tolie within 0 and 1, $logit(p)$ can range from $-\infty$ to
$\infty$.
## Logistic regression
Let's instead consider the following linear model for the log-odds of
$p$:
$${logit(p)} = \beta_0 + \beta_1\times HP + \beta_2\times ATK + \cdots + \beta_6\times SPE$$
Since there is a one-to-one relationship between probabilities and log-odds, we
can undo the previous function. If we create a linear model on the **log-odds**,
we can "work backwards" to obtain predicted probabilities that are guaranteed to
lie between 0 and 1. To "work backwards," we use the **logistic function**:
$$f(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x}$$
So, our *linear* model for $logit(p)$ is equivalent to
$$p = \frac{e^{\beta_0 + \beta_1HP + \beta_2ATK + \cdots + \beta_6SPE}}{1 + e^{\beta_0 + \beta_1\times HP + \beta_2\times ATK + \cdots + \beta_6\times SPE}}$$
## Model fitting
We fit the logistic regression model using the `glm` function, which stands for
generalized linear model. We need to additionally tell R that we want a logistic
model, and so we include the argument `family = "binomial"`.
In general, the syntax is:
```{r, eval = F}
moderl <- glm(outcome ~ pred1 + pred2 + ..., data = ___, family = "binomial")
```
where `outcome` is a binary numeric outcome with values `0` and `1`, and `pred_`
are the predictor variables. As well, we can use the `tidy` function to see a
tidy output of the model object.
### Exercise 1
Fit a logistic regression model predicting `legendary` status using the
predictors `hp`, `atk`, `def`, `spa`, `spd`, and `spe` using the `pokemon`
dataset, and examine the tidy output of this model.
```{r ex-1, exercise = TRUE}
logit_mod <- glm(___ ~ ___, data = ___,
family = "binomial")
tidy(___)
```
We have a *linear* model on a transformation of the response. Thus, we can
interpret estimated coefficients analogously to what we've learned before for
linear models.
## Interpreting coefficients
Let's take a look at the output from the model in the previous exercise:
```{r echo = F}
logit_mod <- glm(legendary ~ hp + atk + def + spa + spd + spe, data = pokemon,
family = "binomial")
tidy(logit_mod) %>%
mutate_if(is.numeric, round, 4)
```
Holding all other predictors constant, for each unit increase in base
speed, we expect the log-odds of being legendary to increase by 0.0366.
If we exponentiate this, then we have the multiplicative effect on the **odds**
scale (instead of the additive effect on the log-odds scale). Thus, an
equivalent interpretation might be that a Pokemon that has a base speed one unit
larger than another would have exp(0.0366) $\approx$ 1.0373 times the *odds* of
being legendary (holding all other predictors constant).
**Note**: categorical variables are converted into binary dummy variables just
as in regular linear models, and are interpreted as the difference in log-odds
compared to the baseline value (holding all else constant). We can also
exponentiate the coefficient estimate, which gives us the odds ratio comparing
an observation satisfying a certain dummy condition vs. the baseline (holding
all else constant).
## Predicting new values
Like in regular linear regression, we can predict outcomes given any set of
predictor values. In this case, our outcome is the log-odds of being legendary
given a Pokemon's base stats.
We can the `augment` function (part of the `broom` package) to create predicted
probabilities. The following code creates a new dataset that has the predictor
values for our four hypothetical Pokemon from before. We will use the `augment`
function that takes our new dataset and plugs it into an existing logistic
regression model (in this case, one called `logit_mod`). Finally, we will `pull`
the `.fitted` values as a vector.
```{r}
new_pokemon <- tibble(hp = c(55, 90, 91, 104),
atk = c(25, 110, 134, 125),
def = c(30, 130, 95, 105),
spa = c(60, 75, 100, 148),
spd = c(50, 80, 100, 102),
spe = c(102, 45, 80, 136))
pred_log_odds <- augment(logit_mod, newdata = new_pokemon) %>%
pull(.fitted)
```
These are the predicted log-odds of being legendary, given the observed base
stats of our four hypothetical Pokemon.
How might we go from log-odds back to probabilities? We can "undo" the earlier
transformation! Letting $y$ be the log-odds and $p$ the probability, we have
$$p = \frac{e^y}{1 + e^y}$$
```{r}
```
### Exercise 2
Using this transformation and your predicted log-odds, what are the predicted
*probabilitieS* of being legendary? How might you classify these Pokemon given
their predicted probabilities?
```{r ex-2, exercise = TRUE}
newdata <- tibble(___,
___,
...)
pred_probs <- augment(___, newdata = ___) %>%
mutate(p = exp(___)/(1 + exp(___)) %>%
pull(p)
```
## Return to Learn R Tutorials Page
[Click here.](https://duke-learning-r.netlify.app)