-
Notifications
You must be signed in to change notification settings - Fork 0
/
irt.Rmd
218 lines (136 loc) · 6.75 KB
/
irt.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
output:
pdf_document:
toc: yes
toc_depth: '4'
html_document:
toc: yes
toc_float: yes
toc_depth: 4
---
## Classical Test Theory
Classical test theory, also known as True Score Theory, is mainly to recognize and develop the reliability of psychological tests and assessments in psychological testing
### Assumption
* The correlation between the error and the true score is 0 $\quad \rho(T,E)=0$
* If a person's certain psychological trait can be repeatedly measured several times with parallel tests, the average of their observed scores will be close to the true score $\quad$ E(X)=T
* The measurement error is random and obeys a normal random variable with the mean value of zero
* No correlation between errors on each parallel test $\quad \rho(E_1,E_2)=0$
### CTT Model
There is a linear relationship between the observed and true fractions
$$ X = T + E $$
where X is observed score, T is Ture score and E is error score. Besides
*Variance*:
$$ S^2_X= S^2_T+S^2_E $$
### Limitation
* The sample dependence of statistics, the estimation of validity, reliability, difficulty, discrimination, and other parameters is very dependent on the sample, and the representativeness of the sample to the whole must be emphasized
* Test dependence of measurement scores, because it is difficult to establish "parallel papers", the scores on two different tests measuring the same ability are not comparable.
* Imprecision of reliability estimation. CTT assumes that the measurement error is the same for subjects of different ability levels, but in fact, a test can only be administered to subjects of comparable ability level and test difficulty to obtain relatively high measurement accuracy.
## Item Response Theory
Item Response Theory (IRT), also known as Latent Trait Theory or Latent Trait Model, is a modern psychometric theory that uses item characteristics to estimate latent traits in response to the limitations of CTT.
### Assumption
* Unidimensionality of ability hypothesis - meaning that all items comprising a given test measure the same latent trait.
* Local independence hypothesis - meaning that no correlation exists between items for a given subject.
* Item characteristic curve hypothesis - refers to a model of the functional relationship between the probability of a subject's correct response to an item and his or her ability.
### Item Difficulty and Item Characteristic Cure
Logistic Cure or Item characteristic cure (ICC), is a function relating the probability of correct response to the ability scale.
$$P_i(\theta)=P(U_i=1|\theta)$$
To include the covariates
$$P(U_i|\theta,b_i)=\frac{e^{\theta-b_i}}{1+e^{\theta-b_i}}$$
where $b_i$ is the item difficulty parameter.
Example:
$$P_1(\theta)=\frac{e^{\theta+1}}{1+e^{\theta+1}}, \quad P_2(\theta)=\frac{e^{\theta-1}}{1+e^{\theta-1}}$$
一条逻辑曲线就是一套试题
When $\theta=b_i$, we say that the ability match the difficulty of the tests and we say $P_i=0.5$.
In practice, we set $b_i \in (-3,3)$. The larger $b_i$, the harder item.
### Outfit
Does the model fit the data well?
*Residual-based fit statistics*
$$P_i(\theta)=P(U_i=1|\theta), \quad \mathbb{E}(U_{ji})=P(U_i=1|\theta=\theta_j)$$
*Variance* (Bernulli Distribution)
$$Var(U_{ji})=P(U_i=1|\theta_j)(1-P(U_i=1|\theta_j))=0.05\times 0.95=0.0475$$
假设A每次答对的概率是0.05,所以variance很小
如果B能力更强,答对的概率是0.43,那么对应的variance会是
$$Var(U_{ji})=P(U_i=1|\theta_j)(1-P(U_i=1|\theta_j))=0.43\times 0.57=0.2451$$
如果C是个学霸,答对的概率是0.91,variance也很小
$$Var(U_{ji})=P(U_i=1|\theta_j)(1-P(U_i=1|\theta_j))=0.91\times 0.09=0.0819$$
可以相对应的算标准差, 随着能力($\theta$)的变化,两边小中间大
*Error Bar*
When a item is 'off-target' (too easy or too hard for a person), the variance is relatively small.
*Standardized Residual*
$$Z_{ji}=\frac{U_{ji}-\mathbb{E}(U_{ji})}{\sqrt{Var(U_{ji})}}$$
*Outfit Mean-square Value (MNSQ)*
$$outfit_i=\frac{1}{N}\sum_{j=1}^Nz^2_{ji}$$
Under-fit (bad qustions)
There is a standard in Linacre (2006)
+ $outfit > 2$, bad question, remove
+ $1.5 - 2.0$, not too bad, acceptable
+ $0.5 - 1.5$, good question
+ $<0.5$, not very much influential, acceptable
### Two-parameter logistic model
$$P_i(\theta)=\frac{e^{\theta-b_i}}{1+e^{\theta-bi}}$$
$$P_i(\theta)=\frac{e^{a_i(\theta-b_i)}}{1+e^{a_i(\theta-b_i)}}$$
+ $a_i$ is the discrimination parameter.
+ $b_i$ is the item difficulty parameter.
### Rasch Model
In the Rasch model, the discrinimation parameters of two-parameter logistic models are set to 1, which are equal to the one-parameter logistic models, or they are set to the same value $a>0$
$$P_i(\theta)=\frac{e^{a_i(\theta-b_i)}}{1+e^{a_i(\theta-b_i)}}$$
### A drawback to CTT/UIRT
Classical Test Theory and Item Response Theory
Remedial Course
In traditional method such as classical test theory (CTT)/unidimensional item response theory (UIRT), teachers only use total scores abilities to compare students' learning performance.
Moreover, some students may have the same total score but their mastering skills are differet.
In recent remedial teaching, most teachers use the same learning materials for the students whose total score below than a threshold. 缺点总结,如何改进呢?
### Three-parameter Logistic Model
$$[0,1] \rightarrow [0.2,1]$$
$$P_i(\theta)=c_i+(1-c_i)\frac{e^{a_i(\theta-b_i)}}{1+e^{a_i(\theta-b_i)}}$$
+ $a_i$ discrimination parameter
+ $b_i$ item difficulty parameter
+ $c_i$ guessing parameter
### MTH017 Midterm Example with R
```{r echo=TRUE,message=FALSE,eval=TRUE}
library(CTT)
library(TAM)
library(stringi)
library(tidyverse)
library(readxl)
######setwd("~/Item_Repsone_Theoryu")
#=======
#setwd("~/Item_Repsone_Theoryu")
data1<-read_excel('mth017_mid.xlsx')
data1<-data1 %>% select(2:29)
data2<- apply(data1,2,function(x) str_sub(x,-1,-1))
answer_key<-str_sub(c('1A','2B','3B','4A','5A','6B','7D','8A','9D','10B','11A','12B','13D','14A','15C','16B','17B','18C','19C','20E','21A','22B','23C','24C','25B','26C','27D','28C'),-1,-1)
CTT_score<-score(data2,answer_key,output.scored=T)
#CTT_score$scored
```
```{r echo=TRUE,message=FALSE,eval=TRUE,results='hide'}
library(TAM)
logit1<-tam(CTT_score$scored)
```
Then, we can extract the results
difficulty level
```{r echo=TRUE,message=FALSE,eval=TRUE}
logit1$xsi
```
Mean Sum of Squares
```{r echo=TRUE,message=FALSE,eval=TRUE}
msq.itemfit(logit1)
```
```{r}
plot(logit1,item=1)
```
*Two Parameter IRT*
```{r echo=TRUE,message=FALSE,eval=TRUE,results='hide'}
logit2<-tam.mml.2pl(CTT_score$scored)
```
```{r echo=TRUE,message=FALSE,eval=TRUE}
logit2$item
```
+ AXsi_.Cat1 is the difficulty level
+ B.Cat1.Dim1 is the discrimination level
```{r,results='hide'}
par(mfrow=c(2,2))
for(i in 1:4){
plot(logit2,items=i)
}
```