-
Notifications
You must be signed in to change notification settings - Fork 1
/
tmwr-ch6-parsnip-classification.Rmd
154 lines (124 loc) · 4.37 KB
/
tmwr-ch6-parsnip-classification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
title: "Tidy Modeling with R: Chapter 6 (Parsnip Classification)"
output:
html_document: default
pdf_document: default
---
```{r performance-setup, include = FALSE}
library(tidymodels)
library(tidyverse)
library(modeldata)
tidymodels_prefer()
```
Chapter 6 of the book TMWR focuses on fitting regression models with Parsnip. In this notebook, we examine how to fit a classification model.
The data used is the Watson job attrition dataset, which is available from the `modeldata` package.
```{r data}
data(attrition)
glimpse(attrition)
```
We split the data 80/20 into training and test sets, stratified by the outcome variable `Attrition`.
```{r split}
set.seed(502)
a_split <- initial_split(attrition, prop = 0.80, strata = Attrition)
a_train <- training(a_split)
a_test <- testing(a_split)
```
# Logistic Regression with Parsnip
We construct a logistic regression model object.
```{r}
a_lm <-
logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
```
We can view the call to the underlying object using the `translate()` method.
```{r}
translate(a_lm)
```
We can fit it using a formula via the `fit()` method.
```{r}
a_form_fit <-
a_lm %>%
fit(Attrition ~ ., data = a_train)
tidy(a_form_fit)
```
We can also fit the model using data via the `fit_xy()` method, where `x` is a tibble contain the independent variables (features) and `y` is a vector containing the dependent variable (response).
```{r}
a_xy_fit <-
a_lm %>%
fit_xy(
x = a_train %>% select(-Attrition),
y = a_train %>% pull(Attrition)
)
tidy(a_xy_fit)
```
The two methods produce exactly the same coefficients and associated statistics.
```{r}
all( tidy(a_xy_fit) == tidy(a_form_fit) )
```
# Random Forest Classification Example
We can construct a random forest classification model object using the same approach as we used for the logistic regression model.
We specify two hyperparameters, the number of trees, and the number of data points required to make a split in a tree. Parsnip creates standard names for hyperparameters, so we would use the same names for `randomForest` as we do for `ranger` below.
```{r}
a_rf <-
rand_forest(trees = 1000, min_n = 5) %>%
set_engine("ranger") %>%
set_mode("classification")
```
We can view the call to the underlying object using the `translate()` method.
```{r}
translate(a_rf)
```
We can fit the random forest model using a formula via the `fit()` method.
```{r}
a_form_fit <-
a_rf %>%
fit(Attrition ~ ., data = a_train)
tidy(a_form_fit)
```
# Extracting the Underlying Model
We can use `extract_fit_engine()` to obtain the fitted model object from the Parsnip object.
```{r}
a_form_fit %>% extract_fit_engine()
```
This allows us to call methods on the underlying object like the linear model's `summary()` function.
```{r}
a_form_fit %>% extract_fit_engine() %>% summary()
```
We can get just the coefficients too.
```{r}
a_form_fit %>% extract_fit_engine() %>% summary() %>% coef()
```
However, the coefficients are stored in a matrix and the methods to get the fitted model parameters vary from model to model. To obtain model paramet `broom` package's `tidy()` method provides a standard way to get model parameters as a tibble from any type of model.
```{r}
tidy(a_form_fit)
```
# Making Predictions with Parsnip
We use `predict()` function to use the fitted model to make predictions on new data. The output is a tibble with the predictions.
```{r}
a_test_small <- a_test %>% slice(1:5)
predict(a_form_fit, new_data = a_test_small)
```
As the order of rows is the same as the original dataset, it's easy to add columns with additional data, such as the actual classification and predicted probability for each class.
```{r}
a_test_small %>%
select(Attrition) %>%
bind_cols(predict(a_form_fit, a_test_small)) %>%
bind_cols(predict(a_form_fit, a_test_small, type = "prob"))
```
Predictions work the same way for all model types, as we can see with this regression tree model.
```{r}
tree_model <-
decision_tree(min_n = 2) %>%
set_engine("rpart") %>%
set_mode("classification")
tree_fit <-
tree_model %>%
fit(Attrition ~ ., data = a_train)
a_test_small %>%
select(Attrition) %>%
bind_cols(predict(tree_fit, a_test_small))
```
# References
1. Max Kuhn and Julia Silge, _Tidy Modeling with R_, chapter 6. https://www.tmwr.org/models.html (2022)
2. https://parsnip.tidymodels.org/reference/logistic_reg.html