-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathManual.Rmd
258 lines (215 loc) · 9.69 KB
/
Manual.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
---
title: "Statistical Model Manual"
output:
pdf_document:
latex_engine: xelatex
html_document:
df_print: paged
number_sections: yes
toc: yes
toc_float: yes
---
\pagenumbering{gobble}
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(tinytex.verbose = TRUE)
```
Part I is the usage of the model, including arguments and coding details. Part II gives an example.
# Part I Model
We use lasso and group lasso to do the variable selection, and fit a logistic regression model for binary classification. Can deal with all shapes of data, including very large sparse data matrices. Return output for test data, including probabilities and classifications.\
## Input
***reportStats Function***\
reportStats(data, y_name, missing_data, is_numeric, nominal_col, ordinal_col, group_lasso, a)
* **data**
\
A data set with n rows (observations) and (p+1) columns (p variables and 1 binary response variable). An ordinal variable with >= 10 levels will be treated as continues one. In this case,
please make sure the levels are created in increasing alphabetical order.
* **y_name**
\
The name of response variables, i.e."missingAgain". Response variable must be binary.
* **missing_data**
\
Does the given data set contain any missing values? (TRUE or FALSE)
If TRUE, the observations with missing values will be ignored.
* **nominal_col**
\
A vector of column indexes of nominal variables should be provided if is_numeric = TRUE. Otherwise give 0 to this argument then the function will detect all categorical variables automatically. \
ex.1 if column 1 is the only nominal variable, nominal_col = 1; \
ex.2 if column 2, 3, 4 are nominal variables, nominal_col = c(2, 3, 4). \
*_Nominal Variable_:*
\
A categorical variable has several values but the order does not matter. For instance, male or female categorical variable do not have ordering. Recode males as 1 and females as 2.\
* **ordinal_col**
\
A vector of column indexes for ordinal variables.\
ex.1 if column 1 is the only categorical variable, ordinal_col = 1;\
ex.2 if column 2, 3, 4 are ordinal variables, ordinal_col = c(2, 3, 4). \
*_Ordinal Variable_:*
\
Ordinal variables have a natural ordering. For example, speed of a moving object - slow, medium, fast and super fast. There is a natural ordering in it Slow < Medium < Fast < Super Fast.
* **group_lasso**
\
If TRUE, use group lasso model, otherwise use plain lasso. \
If the given data have at least one categorical variable with 3 or more levels, group lasso is suggested.
* **train_pct**
\
Percentage of data for training, the rest would be used for testing.
## Output
The following output will be returned from ***reportStats Function***:
* ROC curve
\
ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It is created by plotting the true positive rate (sensitivity) against the false positive rate (1 − specificity) at various threshold settings. We can get the "best" threshold which has the highest accuracy among all threshold settings, in the meanwhile, the specificity and sensitivity will be returned at this threshold.
* Prediction of test data
\
Plot each person's probabilities (risks) of missing again & classification with the "best" threshold.
* Selected variables
All selected variables and their coefficients will be returned.
* Missing proportion.
The proportion of missing values are returned, including the observations with missing values' percentage, the proportion of missing y(outcome variable) and missing x(predictor variables).
* Accuracy
The accuracy and threshold of the classifier are returned
## R code
### Load packages
```{r}
# loading required packages: glmnet, pROC, readxl
suppressPackageStartupMessages({
library(glmnet);
library(gglasso);
library(pROC);
library(readxl)
})
```
### Function
```{r}
## mp_data is a data frame for the target data set
## a is the proportion of training data
reportStats <- function(data, y_name, missing_data, nominal_col,
ordinal_col, group_lasso, train_pct) {
###################################Data Pre Processing#################################
############# delete obs including missing values
set.seed(33)
if(missing_data==TRUE){
data <- data[complete.cases(data), ]
}else{
data <- data
}
############# factor categorical variables
data[, as.vector(nominal_col)] <-
data.frame(sapply(mp_data[, as.vector(nominal_col)], factor))
data[, as.vector(ordinal_col)] <-
data.frame(sapply(mp_data[, as.vector(ordinal_col)], factor))
############# ordinal>10 levels, transfer to continuous
for(i in ordinal_col){
if(length(levels(data[,i]))>=10){
data[, ordinal_col] <- sapply(mp_data[, ordinal_col], as.numeric)
}
}
############# recode y
names(data)[names(data)==y_name] <- "y"
data$y <- as.factor(data$y)
data$y <- ifelse(data$y != levels(data$y)[1], 1, -1)
############# training & test data
trainDataIndex <- sample(1:nrow(data), train_pct*nrow(data))
trainData <- data[trainDataIndex, ]
testData <- data[-trainDataIndex, ]
Y_train <- data$y[trainDataIndex]
Y_test <- data$y[-trainDataIndex]
#######################################################################################
#######################Logistic Model & Lasso Variables Selection######################
############# data & group
X1 <- model.matrix(y ~., data = data.frame(trainData))
X <- X1[,-1]
group <- attributes(X1)$assign[-1]
X1_test <- model.matrix(y ~., data = data.frame(testData))
X_test <- X1_test[,-1]
if(group_lasso==TRUE){
#################################### group lasso
############# fit group lasso penalized logistic regression
# finds the best lambda parameter by cross validation
# and returns the corresponding model
cv <- cv.gglasso(x=X ,y=Y_train,group=group,loss="logit",pred.loss = "loss")
model <- gglasso(x=X ,y=Y_train,group=group,loss="logit", lambda = cv$lambda.min)
############# test
l <- predict(model, newx=X_test, type="link")
prob <- exp(l) / (1 + exp(l))
} else{
#################################### lasso
# finds the best lambda parameter by cross validation
# and returns the corresponding model
cv <- cv.glmnet(X, y=Y_train, alpha=1, family="binomial",type.measure = "mse")
model <- glmnet(X, y=Y_train, alpha=1, family="binomial", lambda=cv$lambda.min)
# test data
prob <- predict(model,newx = X_test, s=cv$lambda.1se, type="response")
}
#######################################################################################
#######################################Prediction######################################
# selected variables
coefs <- as.matrix(coef(cv))
ix <- which(abs(coefs[,1]) > 0)
selected_coefs_n <- length(ix) - 1
selected_coefs <- coefs[ix,1, drop=FALSE]
# coef transformation
Raw_Coef <- round(selected_coefs, digits=6);
Exponentiated_Coef <- round(exp(selected_coefs),digits=6);
Change_in_Odds <- paste0(format((Exponentiated_Coef-1)*100,digits=6), "%")
table <- data.frame(cbind(Raw_Coef, Exponentiated_Coef, Change_in_Odds))[-1,]
colnames(table)<- c("Raw Coef", "Exponentiated Coef", "Change in Odds")
# choose the best cut off
modelroc1 <- roc(Y_test, as.vector(prob))
threshold <- coords(modelroc1, "best", "threshold")
# probabilities(risks) to predictions
predict1 <- ifelse(as.vector(prob)>threshold[1],1,-1)
# accuracy, confusion matrix
accuracy <- mean(predict1==Y_test)
confusion_matrix <- table(pred=predict1,true=Y_test)
# missing data proportion
p_miss_obs <- 1-sum(complete.cases(data))/nrow(data)
p_miss_y <- sum(is.na(data$y))/nrow(data)
XX <- data[ ,!(names(data) %in% "y")]
p_miss_x <- sum(is.na(XX))/prod(dim(XX))
#######################################################################################
#########################################Output########################################
print("******************************************************************************")
print("****************************Statistical Report********************************")
# ROC curve
print("ROC curve")
plot(modelroc1, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1,0.2),
grid.col=c("green","red"), max.auc.polygon=TRUE,
auc.polygon.col="skyblue",print.thres=TRUE)
cat("Prediction of test data", "\n")
cat("Plot each person's probabilities of missing again", "\n")
# predict test data
plot(prob, pch = 16, col="grey", ylab = "Probability")
points(which(predict1!=Y_test),
as.vector(prob)[which(predict1!=Y_test)],
pch = 13,col="red", cex = 1.2)
abline(h=threshold[1], col="purple")
print(threshold)
print(paste('Selected variables:', selected_coefs_n))
print(table)
cat("Confusion Matrix", "\n")
print(confusion_matrix)
print(paste('Accuracy:', format(accuracy,digits=2)))
# missing proportion
print(paste("The proportion of observations with missing values:", p_miss_obs))
print(paste("The proportion of missing y(outcome variable):", p_miss_y))
print(paste("The proportion of missing x(predictor variables):", p_miss_x))
print("******************************************************************************")
#######################################################################################
}
```
# Part II Example
## Read in data
```{r}
# read data
mp_data1 <- readxl::read_xlsx("data/test.xlsx")
mp_data2 <- readxl::read_xlsx("data/train.xlsx")
mp_data <- rbind(mp_data1, mp_data2)
mp_data <- data.frame(mp_data)
mp_data <- mp_data[,-1]
```
## Output
```{r}
reportStats(data=mp_data, y_name="missingAgain", missing_data=FALSE,
nominal_col=c(1, 71, 75, 78, 74), ordinal_col=2, group_lasso=TRUE, train_pct=0.7)
```