forked from JuanLopezMartin/MRPCaseStudy
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-mrp.Rmd
1634 lines (1333 loc) · 106 KB
/
01-mrp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Introduction to Mister P
```{r packages-1, warning=FALSE, message=FALSE, echo = FALSE, cache=FALSE}
library(rstan)
library(rstanarm)
library(data.table)
library(dplyr)
library(forcats)
library(tidyr)
library(reshape2)
library(stringr)
library(readr)
library(ggplot2)
library(scales)
library(bayesplot)
library(gridExtra)
library(ggalt)
library(usmap)
library(gridExtra)
library(scales)
library(kableExtra)
library(formatR)
theme_set(bayesplot::theme_default())
# Improves performance in some systems, but can be removed if it causes problems
# Sys.setenv(LOCAL_CPPFLAGS = '-march=corei7 -mtune=corei7')
# Use all available cores
options(mc.cores = parallel::detectCores(logical = FALSE))
```
Multilevel regression and poststratification (MRP, also called MrP or Mister P) has become widely used in two closely related applications:
1. Small-area estimation: Subnational surveys are not always available, and even then finding comparable surveys across subnational units is rare. However, public views at the subnational level are often central, as many policies are decided by local goverments or subnational area representatives at national assemblies. MRP allows us to use national surveys to generate reliable estimates of subnational opinion (@park2004bayesian, @lax2009gay, @lax2009states, @kiewiet2018predicting).
2. Using nonrepresentative surveys: Many surveys face serious difficulties in recruiting representative samples of participants (e.g. because of non-response bias). However, with proper statistical adjustment, nonrepresentative surveys can be used to generate accurate opinion estimates (@wang2015xbox, @downes2018multilevel).
This initial chapter introduces MRP in the context of public opinion research. Following a brief introduction to the data, we will describe the two essential stages of MRP: building an individual-response model and using poststratification. First, we take individual responses to a national survey and use multilevel modeling in order to predict opinion estimates based on demographic-geographic subgroups (e.g. middle-aged white female with postgraduate education in California). Secondly, these opinion estimates by subgroups are weighted by the frequency of these subgroups at the (national or subnational) unit of interest. With these two steps, MRP emerged (@gelman1997poststratification) as an approach that brought together the advantages of regularized estimation and poststratification, two techniques that had shown promising results in the field of survey research (see @fay1979estimates and @little1993post). After presenting how MRP can be used for obtaining subregion or subgroup estimates and for adjusting for nonresponse bias, we will conclude with some practical considerations.
## Data {#data}
### Survey data {.unnumbered #survey-data}
The first step is to gather and recode raw survey data. These surveys should include some respondent demographic information and some type of geographic indicator (e.g. state, congressional district). In this case, we will use data from the 2018 Cooperative Congressional Election Study (@2018CCES), a US nationwide survey designed by a consortium of 60 research teams and administered by YouGov. The outcome of interest in this introduction is a dichotomous question:
> Allow employers to decline coverage of abortions in insurance plans (Support / Oppose)
Apart from the outcome measure, we will consider a set of geographic-demographic factors that will be used as predictors in the first stage and that define the geographic-demographic subgroups for the second stage. Even though some of these variables may be continous (e.g. age, income), we must split them into intervals to create a factor with different levels. As we will see in a moment, these factors and their corresponding levels need to match the ones in the postratification table. In this case, we will use the following factors with the indicated levels:
* State: 50 US states ($S = 50$).
* Age: 18-29, 30-39, 40-49, 50-59, 60-69, 70+ ($A = 6$).
* Gender: Female, Male ($G = 2$).
* Ethnicity: (Non-hispanic) White, Black, Hispanic, Other (which also includes Mixed) ($R = 4$).
* Education: No HS, HS, Some college, 4-year college, Post-grad ($E = 5$).
```{r, echo = FALSE, cache=FALSE}
# The US census and CCES data use FIPS codes to identify states. For better
# interpretability, we label these FIPS codes with their corresponding abbreviation.
# Note that the FIPS codes include the district of Columbia and US territories which
# are not considered in this study, creating some gaps in the numbering system.
state_abb <- datasets::state.abb
state_fips <- c(1,2,4,5,6,8,9,10,12,13,15,16,17,18,19,20,21,22,23,24,
25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,
44,45,46,47,48,49,50,51,53,54,55,56)
recode_fips <- function(column) {
factor(column, levels = state_fips, labels = state_abb)
}
# Recode CCES
clean_cces <- function(df, remove_nas = TRUE){
## Abortion -- dichotomous (0 - Oppose / 1 - Support)
df$abortion <- abs(df$CC18_321d-2)
## State -- factor
df$state <- recode_fips(df$inputstate)
## Gender -- dichotomous (coded as -0.5 Female, +0.5 Male)
df$male <- abs(df$gender-2)-0.5
## ethnicity -- factor
df$eth <- factor(df$race,
levels = 1:8,
labels = c("White", "Black", "Hispanic", "Asian", "Native American", "Mixed", "Other", "Middle Eastern"))
df$eth <- fct_collapse(df$eth, "Other" = c("Asian", "Other", "Middle Eastern", "Mixed", "Native American"))
## Age -- cut into factor
df$age <- 2018 - df$birthyr
df$age <- cut(as.integer(df$age), breaks = c(0, 29, 39, 49, 59, 69, 120),
labels = c("18-29","30-39","40-49","50-59","60-69","70+"),
ordered_result = TRUE)
## Education -- factor
df$educ <- factor(as.integer(df$educ),
levels = 1:6,
labels = c("No HS", "HS", "Some college", "Associates", "4-Year College", "Post-grad"), ordered = TRUE)
df$educ <- fct_collapse(df$educ, "Some college" = c("Some college", "Associates"))
# Filter out unnecessary columns and remove NAs
df <- df %>% select(abortion, state, eth, male, age, educ)
if (remove_nas){
df <- df %>% drop_na()
}
return(df)
}
```
```{r, results = 'asis', cache=FALSE, warning=FALSE, message=FALSE}
cces_all_df <- read_csv("data_public/chapter1/data/cces18_common_vv.csv.gz")
# Preprocessing
cces_all_df <- clean_cces(cces_all_df, remove_nas = TRUE)
```
Details about how we preprocess the CCES data using the `clean_cces()` function can be found in the appendix.
The full 2018 CCES consist of almost 60,000 respondents. However, most studies work with a smaller national survey. To show how MRP works in these cases, we take a random sample of 5,000 participants and work with the sample instead of the full CCES. Obviously, in a more realistic setting we would always use all the available data.
```{r, cache=FALSE}
# We set the seed to an arbitrary number for reproducibility.
set.seed(1010)
# For clarity, we will call the full survey with 60,000 respondents cces_all_df,
# while the 5,000 person sample will be called cces_df. 'df' stands for data frame,
# the most frequently used two dimensional data structure in R.
cces_df <- cces_all_df %>% sample_n(5000)
```
```{r, echo=FALSE}
kable(head(cces_df), format = 'markdown')
```
### Poststratification table {.unnumbered #poststratification-table}
The poststratification table reflects the number of people in the population of interest that, according to a large survey, corresponds to each combination of the demographic-geographic factors. In the US context it is typical to use Decennial Census data or the American Community Survey, although we can of course use any other large-scale surveys that reflects the frequency of the different demographic types within any geographic area of interest. The poststratification table will be used in the second stage to poststratify the estimates obtained for each subgroup. For this, it is central that the factors (and their levels) used in the survey match the factors obtained in the census. Therefore, MRP is in principle limited to use individual-level variables that are present both the survey and the census. For instance, the CCES includes information on respondent's religion, but as this information is not available in the census we are not able to use this variable. Chapter 13 will cover different approaches to incorporate noncensus variables into the analysis. Similarly, the levels of the factors in the survey of interest are required to match the ones in the large survey used to build the poststratification table. For instance, the CCES included 'Middle Eastern' as an option for ethnicity, while the census data we used did not include it. Therefore, people who identified as 'Middle Eastern' in the CCES had to be included in the 'Other' category.
In this case, we will base our poststratification table on the 2014-2018 American Community Survey (ACS), a set of yearly surveys conducted by the US Census that provides estimates of the number of US residents according to a series of variables that include our poststratification variables. As we defined the levels for these variables, the poststratification table must have $50 \times 6 \times 2 \times 4 \times 5 = 12,000$ rows. This means we actually have more rows in the poststratification table than observed units, which necessarily implies that there are some combinations in the poststratification table that we don't observe in the CCES sample.
```{r, results = 'asis', cache=FALSE, warning=FALSE, message=FALSE}
# Load data frame created in the appendix. The data frame that contains the poststratification
# table is called poststrat_df
poststrat_df <- read_csv("data_public/chapter1/data/poststrat_df.csv")
```
```{r, echo=FALSE}
kable(head(poststrat_df), format = 'markdown')
```
For instance, the first row in the poststratification table indicates that there are 23,948 Alabamians that are white, male, between 18 and 29 years old, and without a high school degree.
Every MRP study requires some degree of data wrangling in order to make the factors in the survey of interest match the factors available in the census. The code shown in the appendix can be used as a template to download the ACS data and make it match with a given survey of interest.
### Group-level predictors {.unnumbered #group-level-predictors}
The individual-response model used in the first stage can include group-level predictors, which are particularly useful to reduce unexplained group-level variation by accounting for structured differences among the states. For instance, most national-level surveys in the US tend to include many participants from a state such as New York, but few from a small state like Vermont. This can result in noisy estimates for the effect of being from Vermont. The intuition is that by including state-level predictors, such as the Republican voteshare in a previous election or the percentage of Evangelicals at each state, the model is able to account for how similar Vermont is to New York and other more populous states, and therefore to produce more precise estimates. These group-level predictors do not need to be available in the census nor they have to be converted to factors, and in many cases are readily available. A more detailed discussion on the importance of builidng a reasonable model for predicting opinion, and how state-level predictors can be a key element in this regard, can be found in @lax2009states and @buttice2013mrp.
In our example, we will include two state-level predictors: the geographical region (Northeast, North Central, South, and West) and the Republican vote share in the 2016 presidential election.
```{r, results = 'asis', cache=FALSE, warning=FALSE, message=FALSE}
# Read statelevel_predictors.csv in a dataframe called statelevel_predictors
statelevel_predictors_df <- read_csv('data_public/chapter1/data/statelevel_predictors.csv')
```
```{r, echo=FALSE}
kable(head(statelevel_predictors_df), format = 'markdown', digits = 2)
```
### Exploratory data analysis {.unnumbered #EDA}
In the previous steps we have obtained a 5,000-person sample from the CCES survey and also generated a poststratification table using census data. As a first exploratory step, we will check if the frequencies for the different levels of the factors considered in the CCES data are similar to the frequencies reported in the census. If this was not the case, we will start suspecting some degree of nonresponse bias in the CCES survey.
For clarity, the levels in the plots follow their natural order in the case of age and education, ordering the others by the approximate proportion of Republican support.
```{r, fig.width=14, fig.height=3.5, echo=FALSE, fig.align = "center", warning=FALSE, cache=FALSE}
# Age
age_sample <- cces_df %>% mutate(age = factor(age, ordered = FALSE)) %>% group_by(age) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n))
age_post <- poststrat_df %>% mutate(age = factor(age, ordered = FALSE)) %>% group_by(age) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post))
age <- inner_join(age_sample, age_post, by = "age") %>% select(age, Sample, Population)
age_plot <- ggplot() +
ylab("") + xlab("Proportion") + theme_bw() + coord_flip() +
geom_dumbbell(data = age, aes(y = age, x = Sample, xend = Population)) +
geom_point(data = melt(age, id = "age"), aes(y = age, x = value, color = variable), size = 2) +
scale_x_continuous(limits = c(0, 0.35), breaks = c(0, .1, .2, .3)) + theme(legend.position = "none") + ggtitle("Age")
# Gender
male_sample <- cces_df %>% group_by(male) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n))
male_post <- poststrat_df %>% group_by(male) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post))
male <- inner_join(male_sample, male_post, by = "male") %>% select(male, Sample, Population) %>%
mutate(male = factor(male, levels = c(-0.5, 0.5), labels = c("Female", "Male")))
male_plot <- ggplot() +
ylab("") + xlab("") + theme_bw() + coord_flip() +
geom_dumbbell(data = male, aes(y = male, x = Sample, xend = Population)) +
geom_point(data = melt(male, id = "male"), aes(y = male, x = value, color = variable), size = 2) +
scale_x_continuous(limits = c(0, 0.6), breaks = c(0, .2, .4, .6)) + theme(legend.position = "none") + ggtitle("Gender")
# Ethnicity
ethnicity_sample <- cces_df %>% group_by(eth) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n))
ethnicity_post <- poststrat_df %>% group_by(eth) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post))
ethnicity <- inner_join(ethnicity_sample, ethnicity_post, by = "eth") %>% select(eth, Sample, Population)
ethnicity$eth <- factor(ethnicity$eth,
levels = c("Black", "Hispanic", "Other", "White"),
labels = c("Black", "Hispanic", "Other", "White"))
ethnicity_plot <- ggplot() +
ylab("") + xlab("") + theme_bw() + coord_flip() +
geom_dumbbell(data = ethnicity, aes(y = eth, x = Sample, xend = Population)) +
geom_point(data = melt(ethnicity, id = "eth"), aes(y = eth, x = value, color = variable), size = 2) +
scale_x_continuous(limits = c(0, 0.8), breaks = c(0, .2, .4, .6, 0.8)) + theme(legend.position = "none") + ggtitle("Ethnicity")
# Education
educ_sample <- cces_df %>% mutate(educ = factor(educ, ordered = FALSE)) %>% group_by(educ) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n))
educ_post <- poststrat_df %>% mutate(educ = factor(educ, ordered = FALSE)) %>% group_by(educ) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post))
educ <- inner_join(educ_sample, educ_post, by = "educ") %>% select(educ, Sample, Population)
educ$educ <- factor(educ$educ,
levels = c("No HS", "HS", "Some college", "4-Year College", "Post-grad"),
labels = c("No HS", "HS", "Some\nCollege", "4-year\nCollege", "Post-grad"))
educ_plot <- ggplot() +
ylab("") + xlab("") + theme_bw() + coord_flip() +
geom_dumbbell(data = educ, aes(y = educ, x = Sample, xend = Population)) +
geom_point(data = melt(educ, id = "educ"), aes(y = educ, x = value, color = variable), size = 2) +
scale_x_continuous(limits = c(0, 0.33), breaks = c(0, .1, .2, .3)) + theme(legend.position = "none") + ggtitle("Education")
grid.arrange(age_plot, male_plot, ethnicity_plot, educ_plot,
widths = c(1.5, 0.75, 1.5, 1.5))
```
```{r, fig.width=14, fig.height=3.5, echo=FALSE, fig.align = "center", warning=FALSE, cache=FALSE}
state_sample <- cces_df %>% group_by(state) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n))
state_post <- poststrat_df %>% group_by(state) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post))
state <- left_join(state_sample, state_post, by = "state") %>% select(state, Sample, Population) %>% left_join(statelevel_predictors_df, by = "state")
states_order <- state$repvote
state$state <- fct_reorder(state$state, states_order)
state <- state %>% select(state, Sample, Population)
ggplot() +
ylab("") + xlab("Proportion") + theme_bw() + coord_flip() +
geom_dumbbell(data = state, aes(y = state, x = Sample, xend = Population)) +
geom_point(data = melt(state, id = "state"), aes(y = state, x = value, color = variable), size = 2) +
scale_x_continuous(limits = c(0, 0.13), breaks = c(0, .025, .05, .075, .1, .125)) + ggtitle("State") +
theme(legend.position = "bottom", legend.title=element_blank())
```
We see that our 5,000-participant CCES sample does not differ too much from the target population according to the American Community Survey. This should not be surprising, as the CCES intends to use a representative sample.
In general, we recommend checking the differences between the sample and the target population. In this case, the comparison has been based on the factors that are going to be used in MRP. However, even if some non-response bias existed for any of these factors MRP would be able to adjust for it, as we will see more in detail in subsection 4. Therefore, it may be especially important to compare the sample and target population with respect to the variables that are *not* going to be used in MRP -- and, consequently, where we will not be able to correct any outcome measure bias due to differential non-response in these non-MRP variables.
## First stage: Estimating the Individual-Response Model {#first-stage}
The first stage is to use a multilevel logistic regression model to predict the outcome measure based on a set of factors. Having a plausible model to predict opinion is central for MRP to work well.
The model we use in this example is described below. It includes varying intercepts for age, ethnicity, education, and state, where the variation for the state intercepts is in turn influenced by the region effects (coded as indicator variables) and the Republican vote share in the 2016 election. As there are only two levels for gender, it is preferable to model it as a predictor for computational efficiency. Additionally, we include varying intercepts for the interaction between gender and ethnicity, education and age, and education and ethnicity (see @ghitza2013deep for an in-depth discussion on interactions in the context of MRP).
$$
Pr(y_i = 1) = logit^{-1}(
\alpha_{\rm s[i]}^{\rm state}
+ \alpha_{\rm a[i]}^{\rm age}
+ \alpha_{\rm r[i]}^{\rm eth}
+ \alpha_{\rm e[i]}^{\rm educ}
+ \beta^{\rm male} \cdot {\rm Male}_{\rm i}
+ \alpha_{\rm g[i], r[i]}^{\rm male.eth}
+ \alpha_{\rm e[i], a[i]}^{\rm educ.age}
+ \alpha_{\rm e[i], r[i]}^{\rm educ.eth}
)
$$
where:
$$
\begin{aligned}
\alpha_{\rm s}^{\rm state} &\sim {\rm normal}(\gamma^0 + \gamma^{\rm south} \cdot {\rm South}_{\rm s} + \gamma^{\rm northcentral} \cdot {\rm NorthCentral}_{\rm s} + \gamma^{\rm west} \cdot {\rm West}_{\rm s} \\ & \quad + \gamma^{\rm repvote} \cdot {\rm RepVote}_{\rm s}, \sigma^{\rm state}) \textrm{ for s = 1,...,50}\\
\alpha_{\rm a}^{\rm age} & \sim {\rm normal}(0,\sigma^{\rm age}) \textrm{ for a = 1,...,6}\\
\alpha_{\rm r}^{\rm eth} & \sim {\rm normal}(0,\sigma^{\rm eth}) \textrm{ for r = 1,...,4}\\
\alpha_{\rm e}^{\rm educ} & \sim {\rm normal}(0,\sigma^{\rm educ}) \textrm{ for e = 1,...,5}\\
\alpha_{\rm g,r}^{\rm male.eth} & \sim {\rm normal}(0,\sigma^{\rm male.eth}) \textrm{ for g = 1,2 and r = 1,...,4}\\
\alpha_{\rm e,a}^{\rm educ.age} & \sim {\rm normal}(0,\sigma^{\rm educ.age}) \textrm{ for e = 1,...,5 and a = 1,...,6}\\
\alpha_{\rm e,r}^{\rm educ.eth} & \sim {\rm normal}(0,\sigma^{\rm educ.eth}) \textrm{ for e = 1,...,5 and r = 1,...,4}\\
\end{aligned}
$$
Where:
* $\alpha_{\rm a}^{\rm age}$: The effect of subject $i$'s age on the probability of supporting the statement.
* $\alpha_{\rm r}^{\rm eth}$: The effect of subject $i$'s ethnicity on the probability of supporting the statement.
* $\alpha_{\rm e}^{\rm educ}$: The effect of subject $i$'s education on the probability of supporting the statement.
* $\alpha_{\rm s}^{\rm state}$: The effect of subject $i$'s state on the probability of supporting the statement. As we have a state-level predictor (the Republican vote share in the 2016 election), we need to build another model in which $\alpha_{\rm s}^{\rm state}$ is the outcome of a linear regression with an expected value determined by an intercept $\gamma^0$, the effect of the region coded as indicator variables (with Northeast as the baseline level), and the effect of the Republican vote share $\gamma^{\rm demvote}$.
* $\beta^{\rm male}$: The average effect of being male on the probability of supporting abortion. We could have used a similar formulation as in the previous cases (i.e. $\alpha_{\rm g}^{\rm gender} \sim N(0, \sigma^{\rm gender})$), but having only two levels (i.e. male and female) can create some estimation problems.
* $\alpha_{\rm e,r}^{\rm male.eth}$ and $\alpha_{\rm e,r}^{\rm educ.age}$: In the survey literature it is common practice to include these two interactions.
* $\alpha_{\rm e,r}^{\rm educ.eth}$: In the next section we will explore public opinion on required abortion coverage at the different levels of education and ethnicity. It is, therefore, a good idea to also include this interaction.
Readers without a background in multilevel modeling may be surprised to see this formulation. Why are we using terms such as $\alpha_{\rm eth}^{\rm eth}$ instead of the much more common method of creating an indicator variable for each state (e.g. $\beta^{\rm white} \cdot {\rm White}_{i} + \beta^{\rm black} \cdot {\rm Black}_{i} + ...$)? The answer is that this approach allows to share information between the levels of each variable (e.g. different ethnicities), preventing levels with less data from being too sensitive to the few observed values. For instance, it could happen that we only surveyed ten Hispanics, and that none of them turned out to agree that employers should be able to decline abortion coverage in insurance plans. Under the typical approach, the model would take this data too seriously and consider that Hispanics necessarily oppose this statement (i.e. $\beta^{\rm hispanic} = - \infty$). We know, however, that this is not the case. It may be that Hispanics are less likely to support the statement, but from such a small sample size it is impossible to know. What the multilevel model will do is to partially pool the varying intercept for Hispanics towards the average accross all ethnicities (i.e. in our model, the average across all ethnicities is fixed at zero), making it negative but far from the unrealistic negative infinity. This pooling will be data-dependent, meaning that it will pool the varying intercept towards the average more strongly the smaller the sample size in that level. In fact, if the sample size for a certain level is zero, the estimate varying intercept would be the average coefficient for all the other levels. We recommend @gelman2006data for an introduction to multilevel modeling.
The `rstanarm` package allows the user to conduct complicated regression analyses in Stan with the simplicity of standard formula notation in R. `stan_glmer()`, the function that allows to fit generalized linear multilevel models, uses the same notation as the `lme4` package (see documentation [here](https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf)). That is, we specify the varying intercepts as `(1 | group)` and the interactions are expressed as `(1 | group1:group2)`, where the `:` operator creates a new grouping factor that consists of the combined levels of the two groups (i.e. this is the same as pasting together the levels of both factors). However, this syntax only accepts predictors at the individual level, and thus the two state-level predictors must be expanded to the individual level (see [p. 265-266]@gelman2006data). Notice that:
$$
\begin{aligned}
\alpha_{\rm s}^{\rm state} &\sim {\rm normal}(\gamma^0 + \gamma^{\rm south} \cdot {\rm South}_{\rm s} + \gamma^{\rm northcentral} \cdot {\rm NorthCentral}_{\rm s} + \gamma^{\rm west} \cdot {\rm West}_{\rm s} + \gamma^{\rm repvote} \cdot {\rm RepVote}_{\rm s}, \sigma^{\rm state}) \\
&= \underbrace{\gamma^0}_\text{Intercept} +
\underbrace{{\rm normal}(0, \sigma^{\rm state})}_\text{State varying intercept} +
\underbrace{\gamma^{\rm south} \cdot {\rm South}_{\rm s} + \gamma^{\rm northcentral} \cdot {\rm NorthCentral}_{\rm s} + \gamma^{\rm west} \cdot {\rm West}_{\rm s} + \gamma^{\rm repvote} \cdot {\rm RepVote}_{\rm s}}_\text{State-level predictors expanded to the individual level}
\end{aligned}
$$
Consequently, we can then reexpress the model as:
$$
\begin{aligned}
Pr(y_i = 1) =& logit^{-1}(
\gamma^0
+ \alpha_{\rm s[i]}^{\rm state}
+ \alpha_{\rm a[i]}^{\rm age}
+ \alpha_{\rm r[i]}^{\rm eth}
+ \alpha_{\rm e[i]}^{\rm educ}
+ \beta^{\rm male} \cdot {\rm Male}_{\rm i}
+ \alpha_{\rm g[i], r[i]}^{\rm male.eth}
+ \alpha_{\rm e[i], a[i]}^{\rm educ.age}
+ \alpha_{\rm e[i], r[i]}^{\rm educ.eth}
+ \gamma^{\rm south} \cdot {\rm South}_{\rm s} \\
&+ \gamma^{\rm northcentral} \cdot {\rm NorthCentral}_{\rm s}
+ \gamma^{\rm west} \cdot {\rm West}_{\rm s}
+ \gamma^{\rm repvote} \cdot {\rm RepVote}_{\rm s})
\end{aligned}
$$
In the previous version of the model, $\alpha_{\rm s[i]}^{\rm state}$ was informed by several state-level predictors. This reparametrization expands the state-level predictors at the individual level, and thus $\alpha_{\rm s[i]}^{\rm state}$ now represents the variance introduced by the state adjusting for the region and 2016 Republican vote share. Similarly, $\gamma^0$, which previously represented the state-level intercept, now becomes the individual-level intercept. The two parameterizations of the multilevel model are mathematically equivalent, and using one or the other is simply a matter of preference. The former one highlights the role that state-level predictos have in accounting for structured differences among the states, while the later is closer to the `rstanarm` syntax.
```{r, cache=FALSE, warning=FALSE, message=FALSE, eval=TRUE}
# Expand state-level predictors to the individual level
cces_df <- left_join(cces_df, statelevel_predictors_df, by = "state")
```
```{r, cache=FALSE, warning=FALSE, message=FALSE, eval=FALSE}
# Fit in stan_glmer
fit <- stan_glmer(abortion ~ (1 | state) + (1 | eth) + (1 | educ) + male +
(1 | male:eth) + (1 | educ:age) + (1 | educ:eth) +
repvote + factor(region),
family = binomial(link = "logit"),
data = cces_df,
prior = normal(0, 1, autoscale = TRUE),
prior_covariance = decov(scale = 0.50),
adapt_delta = 0.99,
refresh = 0,
seed = 1010)
```
```{r, echo = FALSE, eval=TRUE}
# we save the model for future use. By default we do not retrain the model and
# save it, only retrieving the previously file version. To train the model again,
# simply change eval=TRUE in the previous cell and eval=FALSE in this one.
#saveRDS(fit, file = "data_public/chapter1/models/fit_mrp_1.rds")
fit <- readRDS("data_public/chapter1/models/fit_mrp_1.rds")
```
As a first pass to check whether the model is performing well, we must check that there are no warnings about divergences, failure to converge or tree depth. Fitting the model with the default settings produced a few divergent transitions, and thus we decided to try increasing `adapt_delta` to 0.99 and introducing stronger priors than the `rstanarm` defaults. After doing this, the divergences dissapeared. In the [Computational Issues]() subsection we provide more details about divergent transitions and potential solutions.
```{r}
print(fit)
```
We can interpret the resulting model as follows:
* `Intercept` ($\gamma^0$): The global intercept corresponds to the expected outcome in the logit scale when having all the predictors equal to zero. In this case, this does not have a clear interpretation, as it is then influenced by the varying intercepts for state, age, ethnicity, education, and interactions. Furthermore, it corresponds to the impractical scenario of someone in a state with zero Republican vote share.
* `male` ($\beta^{\rm male}$): The median estimate for this coefficient is `r round(fit$coefficients["male"], 1)`, with a standard error (measured using the Mean Absolute Deviation) of `r round(fit$ses["male"], 1)`. Using the divide-by-four rule (@gelman2020raos, Chapter 13), we see that, adjusting for the other covariates, males present up to a `r round(fit$coefficients["male"], 1)/4*100`% $\pm$ `r round(fit$ses["male"], 1)/4*100`% higher probability of supporting the right of employers to decline coverage of abortions relative to females.
* `repvote` ($\gamma^{\rm repvote}$): As the scale of `repvote` was between 0 and 1, this coefficient corresponds to the difference in probability of supporting the statement between someone that was in a state in which no one voted Republican to someone whose state voted all Republican. This is not reasonable, and therefore we start by dividing the median coefficient by 10. Doing this, we consider a difference of a 10% increase in Republican vote share. This means that we expect that someone from a state with a 55% Republican vote share has approximately $\frac{`r round(fit$coefficients["repvote"], 1)`}{10}/4 = 4\%$ ($\pm `r round(fit$ses["repvote"], 1)/4*100`\%$) higher probability of supporting the statement relative to another individual with similar characteristics from a state in which Republicans received 45% of the vote.
* `regionSouth` ($\gamma^{\rm south}$): According to the model, we expect that someone from a state in the south has, adjusting for the other covariates, up to a `r round(fit$coefficients["factor(region)South"], 1)`/4 = `r round(round(fit$coefficients["factor(region)South"], 1)/4*100, 0)`% ($\pm$ `r round(round(fit$ses["factor(region)South"], 1)/4*100, 0)`%) higher probability of supporting the statement relative to someone from the Northeast, which was the baseline category. The interpretation for `regionNorthCentral` and `regionWest` is similar.
* `Error terms` ($\sigma^{\rm state}$, $\sigma^{\rm age}$, $\sigma^{\rm eth}$, $\sigma^{\rm educ}$, $\sigma^{\rm male.eth}$, $\sigma^{\rm educ.age}$, $\sigma^{\rm educ.eth}$): Remember that the intercepts for the different levels of state, age, ethnicity, education, and the specified interactions are distributed following a normal distribution centered at zero and with a standard deviation that is estimated from the data. The `Error terms` section gives us the estimates for these group-level standard deviations. For instance, $\alpha_{\rm r}^{\rm ethnicity} \sim {\rm normal}(0, \sigma^{\rm ethnicity})$, where the median estimate for $\sigma^{\rm ethnicity}$ is `r round(sqrt(fit$stan_summary["Sigma[eth:(Intercept),(Intercept)]","mean"]), 3)`. In other words, the variyng intercepts for the different ethnicity groups have a standard deviation that is estimated to be `r round(sqrt(fit$stan_summary["Sigma[eth:(Intercept),(Intercept)]","mean"]), 3)` on the logit scale, or `r round(sqrt(fit$stan_summary["Sigma[eth:(Intercept),(Intercept)]","mean"]), 3)`/4 = `r round(round(sqrt(fit$stan_summary["Sigma[eth:(Intercept),(Intercept)]","mean"]), 3)/4, 3)` on the probability scale. In some cases, we may also want to check the intercepts corresponding to the individual levels of a factor. In `rstanarm`, this can be done using `fit$coefficients`. For instance, the median values for the varying intercepts of race are $\alpha^{\rm eth}_{r = {\rm White}}$ = `r round(fit$coefficients["b[(Intercept) eth:White]"], 2)`, $\alpha^{\rm eth}_{r = {\rm Black}}$ = `r round(fit$coefficients["b[(Intercept) eth:Black]"], 2)`, $\alpha^{\rm eth}_{r = {\rm Hispanic}}$ = `r round(fit$coefficients["b[(Intercept) eth:Hispanic]"], 2)`, $\alpha^{\rm eth}_{r = {\rm Other}}$ = `r round(fit$coefficients["b[(Intercept) eth:Other]"], 2)`.
## Second Stage: Poststratification {#second-stage}
### Estimation at the national level {.unnumbered #estimational-national-level}
Currently, all we have achieved is a model that, considering certain factor-type predictors, predicts support for providing employers with the option to decline abortion coverage. To go from this model to a national or subnational estimate, we need to weight the model predictions for the different subgroups by the actual frequency of these subgroups. This idea can be expressed as:
$$
\theta^{MRP} = \frac{\sum N_j \theta_j}{\sum N_j}
$$
where $\theta^{MRP}$ is the MRP estimate, $\theta_{\rm j}$ corresponds to the model estimate for a specific subgroup defined in a cell of the poststratification table (e.g. young Hispanic men with a High School diploma in Arkansas), and $N_{\rm j}$ corresponds to the number of people in that subgroup. For a more in-depth review of poststratification, see Chapter 13 of @gelman2020raos.
The values of $\theta_{j}$ for the different subgroups can be obtained with the `posterior_epred()` function. Of course, as `stan_glmer()` performs Bayesian inference, $\theta_{j}$ for any given subgroup will not be a single point estimate but a vector of posterior draws.
```{r, message=FALSE, cache=FALSE, echo=TRUE}
# Expand state level predictors to the individual level
poststrat_df <- left_join(poststrat_df, statelevel_predictors_df, by = "state")
# Posterior_epred returns the posterior estimates for the different subgroups stored in the
# poststrat_df dataframe.
epred_mat <- posterior_epred(fit, newdata = poststrat_df, draws = 1000)
mrp_estimates_vector <- epred_mat %*% poststrat_df$n / sum(poststrat_df$n)
mrp_estimate <- c(mean = mean(mrp_estimates_vector), sd = sd(mrp_estimates_vector))
cat("MRP estimate mean, sd: ", round(mrp_estimate, 3))
```
`posterior_epred()` returns a matrix $P$ with $D$ rows and $J$ columns, where $D$ corresponds to the number of draws from the posterior distribution (in this case 1000, as we specified `draws = 1000`) and $J$ is the number of subgroups in the poststratification table (i.e. 12,000). This matrix, which was called `epred_mat` in our code, is multiplied by a vector $k$ of length $J$ that contains the number of people in each subgroup of the poststratification table. This results in a vector of length $D$ that is then divided by the total number of people considered in the poststratification table, a scalar which is calculated by adding all the values in $k$.
$$\theta^{MRP} = \frac{P \times k}{\sum_j^J k_j}$$
The end result is a vector that we call $\theta^{MRP}$, and which contains $D$ estimates for the national-level statement support.
We can compare these results to the 5,000-person unadjusted sample estimate:
```{r, message=FALSE, cache=FALSE, echo=FALSE}
# Remember that the standard error of a proportion is sqrt(p(1-p)/n). We define
# a function called get_se_bernoulli to obtain the SE based on p and n
get_se_bernoulli <- function(p, n){
return(sqrt(p*(1-p)/n))
}
sample_cces_estimate <- c(mean = mean(cces_df$abortion),
se = get_se_bernoulli(mean(cces_df$abortion), nrow(cces_df)))
cat("Unadjusted 5000-respondent survey mean, sd: ", (round(sample_cces_estimate, 3)))
```
Additionally, we compare with the population support estimated by the full CCES with close to 60,000 participants:
```{r, message=FALSE, cache=FALSE, echo=FALSE}
full_cces_estimate <- c(mean = mean(cces_all_df$abortion),
se = get_se_bernoulli(mean(cces_all_df$abortion), nrow(cces_all_df)))
cat("Unadjusted 60,000-respondent survey mean, sd: ", (round(full_cces_estimate, 3)))
```
In general, we see that both the unadjusted sample estimate and the MRP estimate are quite close to the results of the full survey. In other words, MRP is not providing a notable advantage against the unadjusted sample national estimates. However, it is important to clarify that we were somewhat lucky in obtaining this result as a product of using data from the CCES, a high quality survey that intends to be representative (and appears to be, at least with respect to the variables considered in our poststratification table). Many real-world surveys are not as representative relative to the variables considered in the poststratification step, and in these cases MRP will help correcting the biased estimates from the unadjusted survey. We will see an example of this in section 4, where we exemplify how MRP adjusts a clearly biased sample.
### Estimation for subnational units {.unnumbered #estimational-subnational-level}
As we mentioned, small area estimation is one of the main applications of MRP. In this case, we will get an estimate of the support for employer's right to decline coverage of abortions per state:
$$
\theta_s^{MRP} = \frac{\sum_{j \in s} N_j \theta_j}{\sum_{j \in s} N_j}
$$
```{r, warning=FALSE, echo=FALSE, message=FALSE, fig.height=3.5, fig.width=10, fig.align = "center"}
# Create empty dataframe
states_df <- data.frame(
state = state_abb,
mrp_estimate = NA,
mrp_estimate_se = NA,
sample_cces_estimate = NA,
sample_cces_estimate_se = NA,
full_cces_estimate = NA,
full_cces_estimate_se = NA,
n_sample = NA,
n_full = NA
)
# Loop to populate the dataframe
for(i in 1:nrow(states_df)) {
# Currently, the matrix epred_mat and the poststratification table contain 12,000
# rows. We need to filter the ones that correspond to state in row i. We do so
# by defining the following condition:
filtering_condition <- which(poststrat_df$state == states_df$state[i])
# Filtering matrix epred_mat with filtering_condition
state_epred_mat <- epred_mat[ ,filtering_condition]
# Filtering poststratification table with filtering_condition
k_filtered <- poststrat_df[filtering_condition, ]$n
# Poststratification step
mrp_estimates_vector_sub <- state_epred_mat %*% k_filtered / sum(k_filtered)
# MRP estimate for state in row i
states_df$mrp_estimate[i] <- mean(mrp_estimates_vector_sub)
states_df$mrp_estimate_se[i] <- sd(mrp_estimates_vector_sub)
# 5,000-sample survey unadjusted estimate for state in row i
states_df$sample_cces_estimate[i] <- mean(
filter(cces_df, state==states_df$state[i])$abortion)
states_df$n_sample[i] <- nrow(filter(cces_df, state==states_df$state[i]))
states_df$sample_cces_estimate_se[i] <- get_se_bernoulli(
states_df$sample_cces_estimate[i], states_df$n_sample[i])
# 60,000-person survey unadjusted estimate for state in row i
states_df$full_cces_estimate[i] <- mean(
filter(cces_all_df, state==states_df$state[i])$abortion)
states_df$n_full[i] <- nrow(filter(cces_all_df,
state==states_df$state[i]))
states_df$full_cces_estimate_se[i] <- get_se_bernoulli(
states_df$full_cces_estimate[i], states_df$n_full[i])
}
```
We start visualizing the estimates by state from the unadjusted 5,000-person sample. Again, the states are ordered by Republican vote in the 2016 election, and therefore we expect that statement support to follow an increasing trend.
```{r, warning=FALSE, echo=FALSE, message=FALSE, fig.height=3.5, fig.width=10, fig.align = "center"}
# Order states by republican voteshare
states_df$state <- fct_reorder(states_df$state, states_order)
compare1 <- ggplot(data=states_df) +
geom_point(aes(x=state, y=sample_cces_estimate), color = "#E37B1C") +
geom_errorbar(aes(ymin=sample_cces_estimate - 2*sample_cces_estimate_se,
ymax=sample_cces_estimate + 2*sample_cces_estimate_se,
x=state), alpha=.5, width = 0, color = "#E37B1C") +
scale_y_continuous(breaks=c(0,.25,.5,.75,1),
labels=c("0%","25%","50%","75%","100%"),
expand=c(0,0)) +
coord_cartesian(ylim=c(0, 1)) +
theme_bw() +
labs(x="States",y="Support")+
theme(legend.position="none",
axis.title=element_text(size=10),
axis.text.y=element_text(size=10),
axis.text.x=element_text(angle=90,size=8, vjust=0.3),
legend.title=element_text(size=10),
legend.text=element_text(size=10))
compare2 <- ggplot(data = data.frame())+
geom_point(aes(y=sample_cces_estimate[1], x = .25), color = "#E37B1C") +
geom_errorbar(data=data.frame(), aes(y = sample_cces_estimate[1],
x = .25,
ymin = sample_cces_estimate[1] - 2*sample_cces_estimate[2],
ymax = sample_cces_estimate[1] + 2*sample_cces_estimate[2]),
width = 0, color = "#E37B1C") +
geom_text(aes(x = Inf, y = sample_cces_estimate[1] + 0.06, label = "Unadjusted Sample"),
hjust = -.05, size = 4, color = "#E37B1C") +
scale_y_continuous(breaks=c(0,.25,.5,.75,1),
labels=c("0%","25%","50%","75%","100%"),
limits=c(0,1),expand=c(0,0))+
scale_x_continuous(limits=c(0,1),expand=c(0,0), breaks=c(.25, .75)) +
coord_cartesian(clip = 'off') +
theme_bw()+
labs(x="Population",y="")+
theme(legend.position="none",
axis.title.y=element_blank(),
axis.title.x=element_text(size=10, margin = margin(t = 19, r = 0, b = , l = 0)),
axis.text=element_blank(),
axis.ticks=element_blank(),
legend.title=element_text(size=10),
legend.text=element_text(size=10),
plot.margin = margin(5.5, 105, 5.5, 5.5, "pt")
)
bayesplot_grid(compare1,compare2,
grid_args = list(nrow=1, widths = c(5,1.4)))
```
In states with small samples, we see considerably wide 95\% confidence intervals. We can add the MRP-adjusted estimates to this plot.
```{r, warning=FALSE, echo=FALSE, message=FALSE, fig.height=3.5, fig.width=10, fig.align = "center"}
compare1 <- ggplot(data=states_df) +
geom_point(aes(x=state, y=sample_cces_estimate), color = "#E37B1C") +
geom_errorbar(aes(ymin=sample_cces_estimate - 2*sample_cces_estimate_se,
ymax=sample_cces_estimate + 2*sample_cces_estimate_se,
x=state), alpha=.5, width = 0, color = "#E37B1C") +
geom_point(data=states_df, aes(x=state, y=mrp_estimate), color = "#7B1CE3") +
geom_errorbar(data=states_df, aes(ymin=mrp_estimate - 2*mrp_estimate_se,
ymax=mrp_estimate + 2*mrp_estimate_se,
x=state), alpha=.5, width = 0, color = "#7B1CE3") +
scale_y_continuous(breaks=c(0,.25,.5,.75,1),
labels=c("0%","25%","50%","75%","100%"),
expand=c(0,0))+
coord_cartesian(ylim=c(0, 1)) +
theme_bw()+
labs(x="States",y="Support")+
theme(legend.position="none",
axis.title=element_text(size=10),
axis.text.y=element_text(size=10),
axis.text.x=element_text(angle=90,size=8, vjust=0.3),
legend.title=element_text(size=10),
legend.text=element_text(size=10))
compare2 <- ggplot(data = data.frame())+
geom_point(aes(y=sample_cces_estimate[1], x = .25), color = "#E37B1C") +
geom_errorbar(data=data.frame(), aes(y = sample_cces_estimate[1],
x = .25,
ymin = sample_cces_estimate[1] - 2*sample_cces_estimate[2],
ymax = sample_cces_estimate[1] + 2*sample_cces_estimate[2]),
width = 0, color = "#E37B1C") +
geom_text(aes(x = Inf, y = sample_cces_estimate[1]+0.06, label = "Unadjusted Sample"),
hjust = -.05, size = 4, color = "#E37B1C") +
geom_point(aes(y = mrp_estimate[1], x = .75), color = "#7B1CE3") +
geom_errorbar(aes(y = mrp_estimate[1],
x = .75,
ymin = mrp_estimate[1] - 2*mrp_estimate[2],
ymax = mrp_estimate[1] + 2*mrp_estimate[2]),
width = 0, color = "#7B1CE3") +
geom_text(data = data.frame(), aes(x = Inf, y = mrp_estimate[1]+.0, label = "Sample with MRP"),
hjust = -.05, size = 4, color = "#7B1CE3") +
scale_y_continuous(breaks=c(0,.25,.5,.75,1),
labels=c("0%","25%","50%","75%","100%"),
limits=c(0,1),expand=c(0,0))+
scale_x_continuous(limits=c(0,1),expand=c(0,0), breaks=c(.25, .75)) +
coord_cartesian(clip = 'off') +
theme_bw()+
labs(x="Population",y="")+
theme(legend.position="none",
axis.title.y=element_blank(),
axis.title.x=element_text(size=10, margin = margin(t = 19, r = 0, b = , l = 0)),
axis.text=element_blank(),
axis.ticks=element_blank(),
legend.title=element_text(size=10),
legend.text=element_text(size=10),
plot.margin = margin(5.5, 105, 5.5, 5.5, "pt")
)
bayesplot_grid(compare1,compare2,
grid_args = list(nrow=1, widths = c(5,1.4)))
```
In general, MRP produces less extreme values by partially pooling information across the factor levels. To illustrate this, we can compare the sample and MRP estimates with the results form the full 60,000-respondent CCES. Of course, in any applied situation we would be using the data from all the participants, but as we took a 5,000 person sample the full 60,000-respondent survey serves as a reference point.
```{r, warning=FALSE, echo=FALSE, message=FALSE, fig.height=3.5, fig.width=10, fig.align = "center", results = 'hide'}
compare1 <- ggplot(data=states_df) +
geom_point(aes(x=state, y=sample_cces_estimate), color = "#E37B1C") +
geom_errorbar(aes(ymin=sample_cces_estimate - 2*sample_cces_estimate_se,
ymax=sample_cces_estimate + 2*sample_cces_estimate_se,
x=state), alpha=.5, width = 0, color = "#E37B1C") +
geom_point(data=states_df, aes(x=state, y=mrp_estimate), color = "#7B1CE3") +
geom_errorbar(data=states_df, aes(ymin=mrp_estimate - 2*mrp_estimate_se,
ymax=mrp_estimate + 2*mrp_estimate_se,
x=state), alpha=.5, width = 0, color = "#7B1CE3") +
geom_point(aes(x=state, y=full_cces_estimate), color = "#1CE37B") +
geom_errorbar(data=states_df, aes(ymin=full_cces_estimate - 2*full_cces_estimate_se,
ymax=full_cces_estimate + 2*full_cces_estimate_se,
x=state), alpha=.5, width = 0, color = "#1CE37B") +
scale_y_continuous(breaks=c(0,.25,.5,.75,1),
labels=c("0%","25%","50%","75%","100%"),
expand=c(0,0))+
coord_cartesian(ylim=c(0, 1)) +
theme_bw()+
labs(x="States",y="Support")+
theme(legend.position="none",
axis.title=element_text(size=10),
axis.text.y=element_text(size=10),
axis.text.x=element_text(angle=90,size=8, vjust=0.3),
legend.title=element_text(size=10),
legend.text=element_text(size=10))
compare2 <- ggplot(data = data.frame()) +
geom_point(aes(y=sample_cces_estimate[1], x = .25), color = "#E37B1C") +
geom_errorbar(data=data.frame(), aes(y = sample_cces_estimate[1],
x = .25,
ymin = sample_cces_estimate[1] - 2*sample_cces_estimate[2],
ymax = sample_cces_estimate[1] + 2*sample_cces_estimate[2]),
width = 0, color = "#E37B1C") +
geom_text(aes(x = Inf, y = sample_cces_estimate[1]+0.06, label = "Unadjusted Sample"),
hjust = -.05, size = 4, color = "#E37B1C") +
geom_point(aes(y = mrp_estimate[1], x = .75), color = "#7B1CE3") +
geom_errorbar(aes(y = mrp_estimate[1],
x = .75,
ymin = mrp_estimate[1] - 2*mrp_estimate[2],
ymax = mrp_estimate[1] + 2*mrp_estimate[2]),
width = 0, color = "#7B1CE3") +
geom_text(data = data.frame(), aes(x = Inf, y = mrp_estimate[1]+.0, label = "Sample with MRP"),
hjust = -.05, size = 4, color = "#7B1CE3") +
scale_y_continuous(breaks=c(0,.25,.5,.75,1),
labels=c("0%","25%","50%","75%","100%"),
limits=c(0,1),expand=c(0,0)) +
geom_point(data = data.frame(), aes(y=full_cces_estimate[1], x = .5), color = "#1CE37B") +
geom_errorbar(data = data.frame(), aes(y = full_cces_estimate[1],
x = .5,
ymin = full_cces_estimate[1] - 2*full_cces_estimate[2],
ymax = full_cces_estimate[1] + 2*full_cces_estimate[2]),
width = 0, color = "#1CE37B") +
geom_text(data = data.frame(), aes(x = Inf, y = full_cces_estimate-0.06, label = "Complete Survey"),
hjust = -.06, size = 4, color = "#1CE37B") +
scale_y_continuous(breaks=c(0,.25,.5,.75,1),
labels=c("0%","25%","50%","75%","100%"),
limits=c(0,1),expand=c(0,0))+
scale_x_continuous(limits=c(0,1),expand=c(0,0), breaks=c(.25, .75)) +
coord_cartesian(clip = 'off') +
theme_bw() +
labs(x="Population",y="")+
theme(legend.position="none",
axis.title.y=element_blank(),
axis.title.x=element_text(size=10, margin = margin(t = 19, r = 0, b = , l = 0)),
axis.text=element_blank(),
axis.ticks=element_blank(),
legend.title=element_text(size=10),
legend.text=element_text(size=10),
plot.margin = margin(5.5, 105, 5.5, 5.5, "pt")
)
bayesplot_grid(compare1,compare2,
grid_args = list(nrow=1, widths = c(5,1.4)))
```
Overall, the MRP estimates are closer to the full survey estimates. This is particularly clear for the states with a smaller sample size.
As a final way of presenting the MRP estimates, we can plot them on a US map. The symmetric color range goes from 10% to 90% support, as this scale allows for comparison with the other maps. However, the MRP estimates for statement support are concentrated in a relatively small range, which makes the colors appear muted.
```{r, fig.height=4, fig.width=6, echo=FALSE, warning=FALSE, message=FALSE, fig.align = "center"}
library(usmap)
# Load map and merge data
states_map <- us_map(regions = "states")
states_df_melted <- states_df %>% select(state, mrp_estimate)
states_map <- left_join(states_map, states_df_melted, by = c("abbr" = "state")) %>% drop_na()
# Plot
ggplot(states_map, aes(x = x, y = y, group = group)) +
geom_polygon(colour = "lightgray") +
geom_polygon(aes(fill = mrp_estimate)) + theme_void() +
scale_fill_gradient2(midpoint = 0.5, limits = c(0.1, .9), breaks = c(.1, .5, .9),
name = "Support", low = muted("blue"), high = muted("red")) +
theme(legend.margin=margin(l = 0.5, unit='cm'))
```
### Estimation for subgroups within subnational units {.unnumbered #estimational-subgroup-level}
MRP can also be used to obtain estimates for more complex cases, such as subgroups within states. For instance, we can study support for declining coverage of abortions by state and ethnicity within state. For clarity, we order the races according to their support for the statement.
```{r, message=FALSE, warning=FALSE, echo=FALSE, cache=FALSE}
# Create dataframe for all combinations of state and ethnicity
subgroups_df <- cces_df %>% expand(state, eth) %>%
mutate(mrp_subgroup_estimate = NA,
mrp_subgroup_estimate_se = NA)
# Loop to populate the dataframe
for(i in 1:nrow(subgroups_df)) {
# Filtering and poststratification
filtering_condition <- which(poststrat_df$state == subgroups_df$state[i] &
poststrat_df$eth == subgroups_df$eth[i])
epred_mat_filtered <- epred_mat[, filtering_condition]
k_filtered <- poststrat_df[filtering_condition, ]$n
mrp_subgroup_estimates_vector <- epred_mat_filtered %*% k_filtered / sum(k_filtered)
# Estimates for MRP
subgroups_df$mrp_subgroup_estimate[i] <- mean(mrp_subgroup_estimates_vector)
subgroups_df$mrp_subgroup_estimate_se[i] <- sd(mrp_subgroup_estimates_vector)
}
```
```{r, message=FALSE, warning=FALSE, echo=FALSE, cache=FALSE, fig.height=1.9, fig.width=9, fig.align = "center"}
# Load map and merge data
states_map <- us_map(regions = "states")
subgroups_df_melted <- subgroups_df %>% select(state, mrp_subgroup_estimate, eth)
states_map <- left_join(states_map, subgroups_df_melted, by = c("abbr" = "state")) %>% drop_na()
# Declare order for ethnicity
states_map$eth <- factor(states_map$eth,
levels = c("Black", "Hispanic", "Other", "White"),
labels = c("Black", "Hispanic", "Other", "White"))
# Plot
ggplot(states_map, aes(x = x, y = y, group = group)) +
geom_polygon(colour = "lightgray") +
geom_polygon(aes(fill = mrp_subgroup_estimate)) + theme_void() + facet_grid(cols = vars(eth)) +
scale_fill_gradient2(midpoint = 0.5, limits = c(0.1, .9), breaks = c(.1, .5, .9),
name = "Support", low = muted("blue"), high = muted("red")) +
theme(legend.margin=margin(l = 0.5, unit='cm'), legend.position = "none")
```
Similarly, we can look at the outcome in ethnicity-education subgroups by state.
```{r, message=FALSE, warning=FALSE, echo=FALSE}
# Make the educ column in the poststrat_df data frame an ordered factor. This allows a comparison
# with the educ column in the subgroups_df data frame, which is already an ordered factor. We also
# relabel the levels for prettier plotting
poststrat_df$educ <- factor(poststrat_df$educ,
levels = c("No HS", "HS", "Some college", "4-Year College", "Post-grad"),
labels = c("No HS", "HS", "Some\ncollege", "4-Year\ncollege", "Post-grad"),
ordered = TRUE)
# Create dataframe for all combinations of state and ethnicity
subgroups_df <- poststrat_df %>% expand(state, eth, educ) %>%
mutate(mrp_subgroup_estimate = NA,
mrp_subgroup_estimate_se = NA)
# Loop to populate the dataframe
for(i in 1:nrow(subgroups_df)) {
# Filtering and poststratification
filtering_condition <- which(poststrat_df$state == subgroups_df$state[i] &
poststrat_df$eth == subgroups_df$eth[i] &
poststrat_df$educ == subgroups_df$educ[i])
epred_mat_filtered <- epred_mat[, filtering_condition]
k_filtered <- poststrat_df[filtering_condition, ]$n
mrp_subgroup_estimates_vector <- epred_mat_filtered %*% k_filtered / sum(k_filtered)
# Estimates for MRP
subgroups_df$mrp_subgroup_estimate[i] <- mean(mrp_subgroup_estimates_vector)
subgroups_df$mrp_subgroup_estimate_se[i] <- sd(mrp_subgroup_estimates_vector)
}
```
```{r, message=FALSE, warning=FALSE, echo=FALSE, cache=FALSE, fig.height=11, fig.width=12, fig.align = "center"}
# Load map and merge data
states_map <- us_map(regions = "states")
subgroups_df_melted <- subgroups_df %>% select(state, mrp_subgroup_estimate, eth, educ)
states_map <- left_join(states_map, subgroups_df_melted, by = c("abbr" = "state")) %>% drop_na()
# Declare order for ethnicity
states_map$eth <- factor(states_map$eth,
levels = c("Black", "Hispanic", "Other", "White"),
labels = c("Black", "Hispanic", "Other", "White"))
ggplot(states_map, aes(x = x, y = y, group = group)) +
geom_polygon(colour = "lightgray") +
geom_polygon(aes(fill = mrp_subgroup_estimate)) + theme_void() + facet_grid(vars(educ), vars(eth)) +
scale_fill_gradient2(midpoint = 0.5, limits = c(0.1, .9), breaks = c(.1, .5, .9),
name = "Support", low = muted("blue"), high = muted("red")) +
theme(legend.margin=margin(l = 0.5, unit='cm'), legend.position = "none")
```
## Adjusting for Nonrepresentative Surveys {#nonrepresentative-survey}
We have already introduced that MRP is an effective statistical adjustment method to correct for differences between the sample and target population for a set of key variables. We start this second example by obtaining an artificially nonrepresentative sample that gives more weight to respondents that are older, male, and from Republican states.
```{r, warning=FALSE}
set.seed(1010)
# We add the state-level predictors to cces_all_df
cces_all_df <- left_join(cces_all_df, statelevel_predictors_df, by = "state")
# We take a sample from cces_all_df giving extra weight to respondents that are older, male,
# and from Republican states.
cces_biased_df <- cces_all_df %>%
sample_n(5000, weight = I(5*repvote + (age=="18-29")*0.5 + (age=="30-39")*1 +
(age=="40-49")*2 + (age=="50-59")*4 +
(age=="60-69")*6 + (age=="70+")*8 + (male==1)*20 +
(eth=="White")*1.05))
```
```{r, fig.width=14, fig.height=3.5, echo=FALSE, fig.align = "center", warning=FALSE, cache=FALSE}
# Age
age_sample <- cces_biased_df %>% mutate(age = factor(age, ordered = FALSE)) %>% group_by(age) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n))
age_post <- poststrat_df %>% mutate(age = factor(age, ordered = FALSE)) %>% group_by(age) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post))
age <- inner_join(age_sample, age_post, by = "age") %>% select(age, Sample, Population)
age_plot <- ggplot() +
ylab("") + xlab("Proportion") + theme_bw() + coord_flip() +
geom_dumbbell(data = age, aes(y = age, x = Sample, xend = Population)) +
geom_point(data = melt(age, id = "age"), aes(y = age, x = value, color = variable), size = 2) +
scale_x_continuous(limits = c(0, 0.35), breaks = c(0, .1, .2, .3)) + theme(legend.position = "none") + ggtitle("Age")
# Gender
male_sample <- cces_biased_df %>% group_by(male) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n))
male_post <- poststrat_df %>% group_by(male) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post))
male <- inner_join(male_sample, male_post, by = "male") %>% select(male, Sample, Population) %>%
mutate(male = factor(male, levels = c(-0.5, +0.5), labels = c("Female", "Male")))
male_plot <- ggplot() +
ylab("") + xlab("") + theme_bw() + coord_flip() +
geom_dumbbell(data = male, aes(y = male, x = Sample, xend = Population)) +
geom_point(data = melt(male, id = "male"), aes(y = male, x = value, color = variable), size = 2) +
scale_x_continuous(limits = c(0, 0.6), breaks = c(0, .2, .4, .6)) + theme(legend.position = "none") + ggtitle("Gender")
# Ethnicity
ethnicity_sample <- cces_biased_df %>% group_by(eth) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n))
ethnicity_post <- poststrat_df %>% group_by(eth) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post))
ethnicity <- inner_join(ethnicity_sample, ethnicity_post, by = "eth") %>% select(eth, Sample, Population)
ethnicity$eth <- factor(ethnicity$eth,
levels = c("Black", "Hispanic", "Other", "White"),
labels = c("Black", "Hispanic", "Other", "White"))
ethnicity_plot <- ggplot() +
ylab("") + xlab("") + theme_bw() + coord_flip() +
geom_dumbbell(data = ethnicity, aes(y = eth, x = Sample, xend = Population)) +
geom_point(data = melt(ethnicity, id = "eth"), aes(y = eth, x = value, color = variable), size = 2) +
scale_x_continuous(limits = c(0, 0.9), breaks = c(0, .2, .4, .6, 0.8)) + theme(legend.position = "none") + ggtitle("Ethnicity")
# Education
educ_sample <- cces_biased_df %>% mutate(educ = factor(educ, ordered = FALSE)) %>% group_by(educ) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n))
educ_post <- poststrat_df %>% mutate(educ = factor(educ, ordered = FALSE)) %>% group_by(educ) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post))
educ <- inner_join(educ_sample, educ_post, by = "educ") %>% select(educ, Sample, Population)
educ$educ <- factor(educ$educ,
levels = c("No HS", "HS", "Some college", "4-Year College", "Post-grad"),
labels = c("No HS", "HS", "Some\nCollege", "4-year\nCollege", "Post-grad"))
educ_plot <- ggplot() +
ylab("") + xlab("") + theme_bw() + coord_flip() +
geom_dumbbell(data = educ, aes(y = educ, x = Sample, xend = Population)) +
geom_point(data = melt(educ, id = "educ"), aes(y = educ, x = value, color = variable), size = 2) +
scale_x_continuous(limits = c(0, 0.33), breaks = c(0, .1, .2, .3)) + theme(legend.position = "none") + ggtitle("Education")
grid.arrange(age_plot, male_plot, ethnicity_plot, educ_plot,
widths = c(1.5, 0.75, 1.5, 1.5))
```
```{r, fig.width=14, fig.height=3.5, echo=FALSE, fig.align = "center", warning=FALSE}
state_sample <- cces_biased_df %>% group_by(state) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n))
state_post <- poststrat_df %>% group_by(state) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post))
state <- full_join(state_sample, state_post, by = "state") %>% select(state, Sample, Population)
state$state <- fct_reorder(state$state, states_order)
ggplot() +
ylab("") + xlab("Proportion") + theme_bw() + coord_flip() +
geom_dumbbell(data = state, aes(y = state, x = Sample, xend = Population)) +
geom_point(data = melt(state, id = "state"), aes(y = state, x = value, color = variable), size = 2) +
scale_x_continuous(limits = c(0, 0.13), breaks = c(0, .025, .05, .075, .1, .125)) + ggtitle("State") +
theme(legend.position = "bottom", legend.title=element_blank())
```
```{r, cache=FALSE, warning=FALSE, message=FALSE, eval=FALSE}
fit <- stan_glmer(abortion ~ (1 | state) + (1 | eth) + (1 | educ) + (1 | age) + male +
(1 | male:eth) + (1 | educ:age) + (1 | educ:eth) +
repvote + factor(region),
family = binomial(link = "logit"),
data = cces_biased_df,
prior = normal(0, 1, autoscale = TRUE),
prior_covariance = decov(scale = 0.50),
adapt_delta = 0.99,
refresh = 0,
seed = 1010)
```
```{r, echo = FALSE, eval=TRUE}
# we save the model for future use. By default we do not retrain the model and
# save it, only retrieving the previously file version. To train the model again,
# simply change eval=TRUE in the previous cell and eval=FALSE in this one.
#saveRDS(fit, file = "data_public/chapter1/models/fit_mrp_2.rds")
fit <- readRDS("data_public/chapter1/models/fit_mrp_2.rds")
```
```{r, cache = TRUE, echo = FALSE, warning=FALSE, message=FALSE}
# National
epred_mat <- posterior_epred(fit, newdata = poststrat_df, draws = 4000)
mrp_estimates_vector <- epred_mat %*% poststrat_df$n / sum(poststrat_df$n)
mrp_estimate <- c(mean = mean(mrp_estimates_vector), sd = sd(mrp_estimates_vector))
cces_estimate <- c(mean = mean(cces_biased_df$abortion), se = get_se_bernoulli(mean(cces_biased_df$abortion), nrow(cces_biased_df)))
full_cces_estimate <- c(mean = mean(cces_all_df$abortion), se = get_se_bernoulli(mean(cces_all_df$abortion), nrow(cces_all_df)))
# By state
states_df <- data.frame(
state = state_abb,
mrp_estimate = NA,
mrp_estimate_se = NA,
cces_estimate = NA,
cces_estimate_se = NA,
full_cces_estimate = NA,
full_cces_estimate_se = NA,
n_sample = NA,
n_full = NA
)
for(i in 1:nrow(states_df)) {
filtering_condition <- which(poststrat_df$state == states_df$state[i])
state_epred_mat <- epred_mat[ ,filtering_condition]
k_filtered <- poststrat_df[filtering_condition, ]$n
mrp_estimates_vector <- state_epred_mat %*% k_filtered / sum(k_filtered)
# MRP estimate
states_df$mrp_estimate[i] <- mean(mrp_estimates_vector)
states_df$mrp_estimate_se[i] <- sd(mrp_estimates_vector)
# Biased 5,000-sample survey unadjusted estimate
states_df$cces_estimate[i] <- mean(filter(cces_biased_df, state==states_df$state[i])$abortion)
states_df$n_sample[i] <- nrow(filter(cces_biased_df, state==states_df$state[i]))
states_df$cces_estimate_se[i] <- get_se_bernoulli(states_df$cces_estimate[i], states_df$n_sample[i])
# Full 60,000-person survey unadjusted estimate
states_df$full_cces_estimate[i] <- mean(filter(cces_all_df, state==states_df$state[i])$abortion)
states_df$n_full[i] <- nrow(filter(cces_all_df, state==states_df$state[i]))
states_df$full_cces_estimate_se[i] <- get_se_bernoulli(states_df$full_cces_estimate[i], states_df$n_full[i])
}
```
As expected, our remarkably nonrepresentative sample produces estimates that are lower than what we obtained by using a random sample in the previous section.
```{r, warning=FALSE, echo=FALSE, message=FALSE, fig.height=3.5, fig.width=10, fig.align = "center", cache=FALSE}
# Order states by republican voteshare
states_df$state <- fct_reorder(states_df$state, states_order)
compare1 <- ggplot(data=states_df) +
geom_point(aes(x = state, y = cces_estimate), color = "#E37B1C") +
geom_errorbar(aes(ymin=cces_estimate - 2*cces_estimate_se,
ymax=cces_estimate + 2*cces_estimate_se,
x=state), alpha=.5, width = 0, color = "#E37B1C") +
scale_y_continuous(breaks=c(0,.25,.5,.75,1),
labels=c("0%","25%","50%","75%","100%"),
expand=c(0,0)) +
coord_cartesian(ylim=c(0, 1)) +
theme_bw() +
labs(x="States",y="Support")+
theme(legend.position="none",
axis.title=element_text(size=10),
axis.text.y=element_text(size=10),
axis.text.x=element_text(angle=90,size=8, vjust=0.3),
legend.title=element_text(size=10),
legend.text=element_text(size=10))
compare2 <- ggplot(data = data.frame())+
geom_point(aes(y = cces_estimate[1], x = .25), color = "#E37B1C") +
geom_errorbar(data = data.frame(), aes(y = cces_estimate[1],
x = .25,
ymin = cces_estimate[1] - 2*cces_estimate[2],
ymax = cces_estimate[1] + 2*cces_estimate[2]),
width = 0, color = "#E37B1C") +
geom_text(aes(x = Inf, y = cces_estimate[1] + 0.06, label = "Unadjusted Sample"),
hjust = -.05, size = 4, color = "#E37B1C") +
scale_y_continuous(breaks=c(0,.25,.5,.75,1),
labels=c("0%","25%","50%","75%","100%"),
limits=c(0,1),expand=c(0,0))+
scale_x_continuous(limits=c(0,1),expand=c(0,0), breaks=c(.25, .75)) +
coord_cartesian(clip = 'off') +
theme_bw()+