-
Notifications
You must be signed in to change notification settings - Fork 43
/
Copy pathICI3D_RTutorial_5_DataCleaning.R
356 lines (290 loc) · 13.4 KB
/
ICI3D_RTutorial_5_DataCleaning.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
## Data management and cleaning in R
## Clinic on the Meaningful Modeling of Epidemiological Data
## International Clinics on Infectious Disease Dynamics and Data (ICI3D) Program
## African Institute for Mathematical Sciences, Muizenberg, RSA
##
## (C) Juliet Pulliam, 2011, 2012, 2017
##
## The goal of this tutorial is to acquaint you with ways of
## manipulating and cleaning infectious disease data in R. By
## the end of the tutorial you should be able to:
##
## * import CSV files into R
## * ensure imported data is interpreted correctly by R
## * identify errors and inconsistencies in a data frame
## * correct identified data errors
##
## NOTE: The comments will guide you through the tutorial but you
## should make sure you understand what the code is doing. Some
## function arguments are assigned to ?. This will give you an
## error. You should try out values for these arguments as
## suggested in the comments or find them yourselves in a help file.
######################################################################
## First, you need to import the data you will be working with.
## You should make sure you are in the same
## working directory as your data.
## You presumably downloaded the data and this script to the same place
## You can make sure R studio is "there" by clicking on Session/Set Working
## Directory/To Source File Location
## If you have files in a different place than the script, you can use
##Session/Set Working Directory (or a keyboard shortcut) to navigate there.
## We generally don't recommend using setwd() in scripts, because it causes
## difficulties in sharing and collaboration.
## For this tutorial, we will be using an example dataset that
## resembles (but is not) real data from a measles epidemic in
## Democratic Republic of Congo. The "data" are made available only
## to ICI3D participants and are for use only during the MMED and
## DAIDD clinics, and only for the specified purposes.
## We will be using the 'tidyverse' family of R functions to read in
## and clean the data. Let's get started.
## Data are available at https://github.com/ICI3D/datasets under the dataCleaning folder
library(tidyverse) # Load the tidyverse package
dat <- read_csv("tutorial5.csv") # read in the "data"
## The read_csv() function reads data in as a tibble, a data
## structure that is very useful because it can contain different
## data types in different columns (similar to a data frame, but unlike
## a matrix or array).
######################################################################
## Let's look at the data to get a feeling for the size of the
## data set and what variables we have to work with.
dim(dat) # Determine the number of rows and columns
dat # Look at the beginning of the dataset
## For each individual in our data set, we have the following
## information:
##
## * Province - the province where the case occurred
## * ZS - the health zone where the case occurred
## * Age_annes - the person's age in years
## * Age_Mois - the person's age in months
## * Sexe - the person's sex
## * Date_dern_vacci - the date when the person was vaccinated against
## measles
## * DDS - the date of symptom onset
## * test - the result of a diagnostic test for IgM antibodies against
## measles
## The read_csv() function has read in most of our data as characters.
## Note that this is a different behavior from what would have happened
## using the read.csv() function demonstrated in Thumbi's lecture, and
## this fixes the problem he demonstrated where factors are sometimes
## treated as numbers when they shouldn't be. It also gives us some
## useful information about problems we're likely to discover with
## the dataset as we explore further - for example, the fact that the
## Age_annes column has been read in as characters indicates there are
## likely to be some problematic entries.
## We should always go through our data sets to ensure that
## R treats each variable as the appropriate data type. To do this,
## it will help to know the different ways in which R can store and
## interpret data. To see a list of the options for data types:
?typeof
#########
## Hands off the keyboard! Pick up a writing pad and...
#########
##
## Decide what data type each variable in the data set should be.
## Refer to the lectures on study design or data cleaning, if necessary.
##
#########
## Before we start working with the data itself, let's rename the columns
## so they're easier to interpret.
dat <- rename(dat
, province = Province
, healthZone = ZS
, ageYears = Age_annes
, ageMonths = Age_Mois
, sex = Sexe
, dateVaccinated = Date_dern_vacci
, dateOnset = DDS
, testResultIgM = test
)
## Here is a cleaned up version of the data key, with the new
## column names and information about the data types we will end up
## with after cleaning the data:
##
## * province - the province where the case occurred (character)
## * healthZone - the health zone where the case occurred (character)
## * ageYears - the person's age in years (integer)
## * ageMonths - the person's age in months (integer)
## * sex - the person's sex (character)
## * dateVaccinated - the date when the person was vaccinated against
## measles (character)
## * dateOnset - the date of symptom onset (character)
## * testResult - the result of a diagnostic test for IgM antibodies against
## measles
##
## We will also create some new variables as we clean the data:
##
## * ageYearsContinuous - the person's (approximate) age in years, as a
## continuous
## variable (double)
## * vaccinationStatus - whether the person had been previously
## vaccinated (logical)
## * testStatus - additional information about the person's diagnostic
## status (character)
######################################################################
##
## Let's begin by looking at ageYears, which should be stored as a
## numeric value.
mode(dat$ageYears) # determine the storage mode of a variable
## As noted above, ageYears is being treated as a character, which
## likely indicates a problem with some of the entries; let's take
## a closer look:
distinct(dat,ageYears)
## The distinct() function lists the unique values for a variable or
## or set of variables. So far, everything looks fine, but the
## function only shows us as much of the data as will fit in the
## console window. We need to look at all of the values to indentify
## what's going on:
## CONSOLE (you can try View() instead of print())
print(
dat
%>% distinct(ageYears)
)
## Scrolling down, we can see the problem: some of the entries have a
## value of 'ND' which of course is not a number, so read_csv()
## decided to treat the entire column as characters. In this dataset,
## 'ND' is a way of denoting missing values, and we can correct
## this easily by telling R that everything in this column should be
## a number. To see what this will do, let's look at the unique values
## again after we cast ageYears as an integer:
## CONSOLE
View(
dat
%>% distinct(as.integer(ageYears))
)
## We see that 'ND' is no longer in the list of values and has been
## replaced by NA, which stands for 'not applicable' and is R's way
## of treating missing information. We can see that R is treating NA
## differently from 'ND' because the NA row is now greyed out. Let's
## update the dataset to reflect this correction:
dat <- (
dat
%>% mutate(ageYears = as.integer(ageYears)) # replace
)
## Now let's look more closely at the data in this column to verify
## that the values make sense:
## CONSOLE
print(
dat
%>% distinct(ageYears)
%>% arrange(ageYears)
)
## This time, instead of just looking at all the values, we've arranged
## them in ascending order, so we can easily see whether all of the
## numeric values are plausible. We can also look explicitly at the range
## of values given:
range(dat$ageYears,na.rm = T)
## Note that the na.rm argument tells the function whether to drop the
## missing values before showing us the range. We have set it to TRUE,
## so we see the range of values after the NAs have been removed.
## Now let's look at the column giving age in months.
mode(dat$ageMonths) # determine the storage mode of a variable
## The variable is stored as a numeric value, as seems appropriate.
## Let's also verify that this numeric value is not representing a
## factor:
class(dat$ageMonths) # determine the object class of a variable
## Great! The ageMonths column seems to be in better shape that the
## ageYears column was. Fill in the missing information in the command
## below to take a closer look:
## CONSOLE
print(
dat
%>% distinct(???) # View distinct values for month ## FIXME
%>% arrange(???) # Arrange values in ascending order ## FIXME
)
## All of the values are numeric, but at least one row has a value of 25,
## which is hard to interpret. In particualr, we need
## to be careful that we understand how the two age columns relate
## to each other, or we could run into trouble at the data analysis phase.
## Let's start by looking at the data with missing values for ageMonths. We
## can also look at the rows that have missing values for ageYears at the
## same time:
print(
dat
%>% filter(is.na(ageMonths)|is.na(ageYears)) # subset the rows that have NA for either age variable
)
## There is one individual with missing values, and they are missing both year
## and month information. While it's not ideal to have missing data, this is
## a reality in most datasets, and in this case it's not that bad.
## Now let's look at the ageMonths values for the subset of the
## data that has an ageYears value of 0:
print(
dat
%>% filter(ageYears==0) # subset the rows that have 0 for ageYears
%>% group_by(ageMonths) # group by ageMonths
%>% summarize(count = n()) # count occurrences of each ageYears value
)
## All ageMonths values for children with ageYears of 0 are between 0
## and 11 (inclusive). This seems like a good start. Now let's look the
## ageYears values for the rest of the data:
print(
dat
%>% filter(???) # subset the rows for ages of at least 1 year ## FIXME
%>% group_by(ageMonths) # group by ageMonths
%>% summarize(count = n()) # count occurrences of each ageMonths value
)
## There is only one row in the dataset where ageMonths is greater than 11,
## so let's take a closer look at that particular row:
print(
dat
%>% filter(ageMonths==25)
)
## How confusing! Both the ageYears and ageMonths values are 25.
## At this stage, we would like to go back to the source of the data
## and double-check the recorded age for this individual. Let's say we
## go back to the data source and learn that the individual had just
## turned 25, meaning that the months entry should be 0 and there was a data
## entry error where the age in years was accidentally entered in both
## columns. We can easily correct this.
## Let's first set up a test case on a simplified version of the dataset:
print(
dat
%>% select(ageYears,ageMonths) # reduce the number of columns for test case
%>% filter(ageYears==25) # subset to look only at the data for 25 year olds
%>% mutate( ageMonthsNEW = ifelse(ageYears == 25 & ageMonths == 25, 0, ageMonths)) # correct the typo
)
## It looks like the mutate command is appropriately correcting the typo, so we
## can now be comfortable replacing the values in the actual data frame:
dat <- (
dat
%>% mutate(ageMonths = ifelse(ageYears == 25 & ageMonths == 25, 0, ageMonths))
)
## Note that it is very difficult to pick up errors in the age variables if they
## don't fall outside the range expected for age or years. To reduce errors that
## are difficult to spot, data are often entered twice by different people
## and then values are compared to look for inconsistencies.
## Now that we have cleaned versions of both ageYears and ageMonths, let's put
## them together to create a new variable ageYearsContinuous that treats age as continuous.
## We can do this using the mutate() function that we saw above. Again, we'll
## first look at the output for a test case and the update the actual data frame:
## CONSOLE
print(
dat
%>% select(ageYears,ageMonths) # reduce the number of columns for test case
%>% mutate( ageYearsContinuous = ageYears + ageMonths/12)
)
## Scan the values of the columns to make sure the conversion has been done right.
## If you were doing a more complicated transformation, you would probably want to
## build in additional checks to identify potential errors.
## The transformation looks good, so let's update our data frame to add the new variable.
## Now that we have a new variable that contains all of the age information, we can also
## remove the original age variables, which are difficult to interpret.
dat <- (
dat
%>% mutate( ageYearsContinuous = ageYears + ageMonths/12) # create new variable
%>% select(-ageYears,-ageMonths) # remove original age variables
)
summary(dat)
ggplot(dat, aes(x=ageYearsContinuous)) + geom_histogram()
######################################################################
## PROBLEM
######################################################################
##
## So far we've been correcting errors in a somewhat adhoc manner. In
## the lecture, we saw how to do this is a more robust way
## using a correction table. Follow the examples from the lecture to
## correct the province names for Kasai Oriental and Kasai Occidental
## in the using the correction table kasaiCorrectionTable.csv
##
######################################################################
corTab <- read_csv('kasaiCorrectionTable.csv')
dat <- ... ## FIXME