forked from ICI3D/RTutorials
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathICI3D_RTutorial_2.R
772 lines (532 loc) · 25.3 KB
/
ICI3D_RTutorial_2.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
## Tutorial 2 - More on Vectors, Data Frames, and Functions
## David M. Goehring 2004
## Juliet R.C. Pulliam 2008,2009
## Steve Bellan 2010, 2012
## Meaningful Modeling of Epidemiologic Data, 2012 AIMS, Muizenberg
######################################################################
## SECTION A. Accessing Vector Elements
######################################################################
## By the end of today you should…
## * Be able to retrieve useful subsets of your data
## * Understand more about data frames
## * Know the methods and uses of logical values in R
## * Be able to generate and use factors
## * Know how to write your own generic functions
####################
## Beyond Numbers: Relational and Logical Operations in R
####################
## So far everything you have done in R has involved numbers or
## vectors of numbers. To properly exploit R’s complexity, you need to
## become familiar with relational and logical operations in R.
## Relational operations work just like numerical operations, in terms
## of how they are processed. Return for a moment to our first
## calculation from the last tutorial, an addition problem:
3 + 4
## The analogous calculation of a single relational operation is
## something like
5 > 4
## "Is 5 greater than 4?” Yes. And R tells you that this is a TRUE
## statement. Or,
1 + 1 < 1
## Makes sense, right?
## The greater-than, >, and less-than, <, symbols are
## straightforward. Similarly, R has greater-than-or-equal-to and
## less-than-or-equal-to symbols, >= and <=, respectively.
## Slightly less intuitive are the relational operators for equality,
## ==, and inequality, !=. Try
x <- 4
x == 1 + 3
y <- x != 4
## This last example demonstrates that variables can hold logical
## values. These relational operators also operate on logical values,
## as in,
y == FALSE
## Logical operations are operations that only make sense when
## performed on TRUE and FALSE values. These will likely be familiar
## to you, the central operations being AND, OR, and NOT.
## The operators used in R are standard: &, |, and !,
## respectively. Let’s see them in action:
!TRUE
to.be <- FALSE
to.be | !to.be
FALSE & (TRUE | FALSE)
## By combining logical and relational operations, we can make complex
## inquiries about values.
## Note that variable names can be words and they can be as long as
## you want. Also note that the assignment operator can be used in the
## other direction, ->. While x, y, and z are easy to type, more
## memorable names such as weight or total, might be more useful for
## reminding you what values they hold.
## Hands off the keyboard! Pick up a writing implement…
## a <- TRUE != (4 > 3)
## b <- a | 1 + 1 == 4 - 2
## c <- !FALSE & (log(Inf) == Inf + 1)
## What do a, b & c equal? Now execute the commands and compare
## your answers.
## I have briefly mentioned that R has special values for infinity,
## Inf, not-a-number, NaN, and not-applicable, NA. I said these
## generally behave very sensibly – a mathematical operation on
## not-a-number is obviously not a number as so is returned as NaN.
## Things are less simple when using logical and relational
## operators. Consider 4 != NaN In one respect, the answer perhaps
## should be TRUE; that is, 4 definitely isn’t equal to
## not-a-number. But, striving for consistency, R returns NA, much as
## it would for a mathematical operation. Even worse is the situation
x <- NaN
x == NaN
## You might think that this is a reasonable
## test for whether x has a numerical value, but it won’t work for the
## same reason mentioned above. In general, keep this trickiness in
## mind and remember there is a special function is.na() for
## determining whether x is a valid number:
is.na(x)
## This is all getting thrown at you in very quick succession,
## especially if you do not have much experience programming in other
## languages. It is worth noting that information about these
## operations can be pulled up at any time by typing help("&”) or
## help(">”) or the using help() function with any of the other
## symbols used in these operations.
####################
## Vectors of Logical Values
####################
## As a shorthand, TRUE and FALSE can be entered as T and F. This
## allows for rapid entry of vectors of logical values, for example,
logical.vec <- c(T,T,F,T)
logical.vec
## Unfortunately, and rather inexplicable, T and F cab be reassigned
## to any arbitrary values. This will render most code utterly
## unpredictable. So, never, never, never do this:
T <- 4 # REALLY BAD,BUT NO ERROR PRODUCED
## And, if you ever do something like this (though you shouldn’t!),
## make sure you quickly do this:
rm(T) ## which will set T (or F) back to its default logical value.
## Relational or logical operations also act on vectors to produce
## vectors of logical values, as in,
x <- rnorm(10)
x < 0
y <- (x > -.5) & (x < .5)
!y
## This will be especially handy when we look at the concept of
## indexing, below.
####################
## Generating Sequences
####################
## There are many occasions in R when you need a patterned sequence of
## numbers. As mentioned last in the last tutorial, most counting can
## be accomplished by use of the seq() function. If you haven’t
## already done so, it is worth taking a look at the help-file on
## seq() because it has a few arguments that can make your life
## easier.
?seq
## For example, seq() can generate a vector of a certain length
## between certain endpoints by typing
x <- seq(0,1,length.out=20)
## giving you a vector of length 20 between 0 and 1, confirmable by
## typing
length(x)
## A very common need in R is to generate vectors with an interval of
## 1 between each element. R has a shorthand for this using the colon
## notation, as follows,
y <- 5:10
## generating a vector that counts from 5 to 10, inclusive. Note that
## : is generally treated first in the order of operations.
## Don’t underestimate the value of the colon notation. Even for
## typing a vector of length 2, like "(1,2)” or "(2,1),” using the c
## function to generate the vector is pretty tedious (e.g., c(1,2)).
## These vectors can be generated in three quick characters by typing
## 1:2 or 2:1, respectively. I will also point your attention to the
## rep() function, for repeating sequences, which can also save time.
####################
## Indexing
####################
## R has an incredibly useful way of accessing items from a
## dataset. Each item in a dataset has its own index, or numbered
## location, in the object’s structure. Square brackets are used to
## extract an item or items from a dataset, but it is crucial to
## understand that there are two completely distinct ways in which
## brackets are used to access items. I will consider the two methods
## for accessing a vector of length n in turn below.
## The first option: Logical
## Requirements: Logical vector of length n
## Use it for: Finding a subset of data based on a rule
## Logical indexing works as if you’ve asked your indexing vector the
## question, "Do you want this item?” for each of the items in the
## vector.
x <- 1:5
x[c(T,F,F,F,T)]
## If we combine this logical indexing with the relational and logical
## operators you learned above, we have an exceptionally powerful tool
## to retrieve data that meet any set of criteria.
y<-rnorm(10000)
hist(y[!((y>-2)&(y<0))])
## I will give more insight below when I discuss indexing data
## frames. Stay tuned.
## In any operation in R, vectors will be automatically repeated until
## they reach the necessary length for the operation to make
## sense. For example, note the results of
1:6 + 1:2
## The same repetition holds for logical vectors.
## The second option: Numerical
## Requirements: Value or vector of any length with values
## (1 to n) OR (-n to -1)
## Use it for: Single item retrieval or shuffling, sorting, and repeating
## Accessing single items with brackets and a single index should be
## straightforward
x <- 3*(0:5)
x[4]
## One tedious way of creating a new vector of values from a vector’s
## elements would be
c(x[2],x[3],x[4]) #TEDIOUS
## So R makes it much easier by allowing a vector of indices to
## generate a vector. Thereby, the command above becomes
x[2:4]
## There is nothing preventing you from accessing any element any
## number of times.
x[c(2,2,2,5,5,5)]
## Additionally, R allows you to use negative indices, indicating
## which items you want to exclude, as in,
x[c(-1,-6)]
## This is fine and productive as long as you remember never to mix
## negative and positive indices – R will not know what you want it to
## do:
x[c(-1,4)] #BAD
####################
## Sorting
####################
## In Tutorial 1, you were introduced to the sort() function, which is
## handy.
## Now that you have been introduced to indexing, you may have an
## inkling of how much more powerful the sorting functions of R can
## become.
## As an introduction, let’s say you have a 4-element vector,
my.vector <- 5:8
## Using numerical indexing, we can manually re-order this vector by
## calling each of its indices once in our preferred order, for
## example
my.vector[c(2,3,4,1)]
## or, for a quick reversal
my.vector[4:1]
## Now, manually generating the vector of indices is not monumentally
## useful, which is where the function order() comes in. As
## demonstration, imagine we have a vector of student names and a
## corresponding vector of student heights (in meters).
stud.names <- c("Carol", "Walter", "Rachael", "Petunia", "Clark",
"Justin")
stud.heights <- rnorm(6,1.7,.12)
## What we definitely don’t want to do is to perform sort() on each of
## these vectors independently. This will eliminate the pairing of the
## name to the height. So how can we sort one vector and have the
## other vector align correctly? Try order() on the names,
order(stud.names)
## Note that it returns the indices in the right order, not the values
## themselves.
## From what you learned above, you know it is now an easy matter to
## sort both of our vectors, as follows,
stud.names[order(stud.names)] #same effect as sort()
stud.heights[order(stud.names)]
## And, obviously, sorting the names by the heights is exactly
## analogous, and it will make for a pretty plot
barplot(stud.heights[order(stud.heights)],
names.arg=stud.names[order(stud.heights)],
ylab="Height (m)", main=
"Student Heights")
## I have conveniently skipped over an important concept, because R
## handles it fairly intuitively, but I want to mention the
## terminology. The variable stud.names and the results of ls(), for
## example, are called vectors of strings or character arrays.” R
## handles them conveniently, so we don’t need to worry too much about
## them, but knowing the terminology will improve your understanding
## of R’s in-line help documents.
######################################################################
## SECTION B. Data Frames, Redux
######################################################################
## Re-introduction to data frames
## Before we cover advanced topics of data frames, I wanted to point out the
## function data.frame() which puts data together to form data
## frames. This is a key alternative to using the prefab data frames
## that you used in last week’s assignment.
## First I want to generate a vector of student class-years to
## correspond to the stud.names before creating a data frame (Freshmen
## as 1, Sophomores as 2, etc.).
stud.years <- c(4,2,2,3,1,3)
## Now making a data frame is easy (each argument will just add more
## columns to the data), the only trick being that we have to assign
## the constructed data frame to a variable, as follows,
student.data <- data.frame(stud.heights,stud.years)
student.data
## Voila! Your own data frame. But, wait, where are our the student names?
## And can we have better column headings than our redundant variable
## names?
## The answers lie in two new functions that we will use with
## assignment notation, names() and row.names(). Let’s take a look:
names(student.data)
row.names(student.data)
## What we see are vectors of strings corresponding to the columns and
## rows, respectively. We can change these by assigning replacement
## strings to the indexed values or by substituting our own vector of
## strings.
names(student.data)[1] <- "heights"
names(student.data)[2] <- "class.years"
row.names(student.data) <- stud.names
## The result is downright beautiful:
student.data
## The assignments above are the first of many examples in R that seem
## to defy logic: it seems as though we’re assigning something to a
## function, which shouldn’t make sense because a function isn’t a
## variable. In fact, you can think of the functions names() and
## row.names() as "access functions” – they do not perform an action,
## but merely grant access to a property of the argument variable, and
## this is why we can make assignments of the sort seen above.
## Now if you want to touch up your data at all, the edit() command
## may come in handy:
edited.student.data <- edit(student.data)
## Attempting to edit data in this way does not work in
## Rstudio. You will have to edit the data frame directly.
## Also, take note that edit() will not automatically update the data
## frame itself (here student.data). That can only happen through an
## assignment.
## Indexing data frames
## As with vectors, brackets and logical or numerical vectors are still
## the way to access data frames, but with a slight complication,
## because data frames are multidimensional. The solution (which also
## holds for matrices, etc.) is to separate the two dimensions with a
## comma. R treats the first entry as the row number and the second
## entry as the column number; thus, to access the second column of
## the fourth row, type
student.data[4,2]
## Or the second column of the last three rows,
student.data[4:6,2]
## Not too tricky? There are two further complications.
## To access an entire row or entire column, leave the index blank, as
## in,
student.data[,1] #FIRST COLUMN
student.data[3,] #THIRD ROW
student.data[,] #ENTIRE FRAME, equivalent to "student.data"
## The only other complication is the ability to enter the names() or
## row.names() as indices:
student.data["Justin",]
## Putting all of this together, we can quickly generate subsets of
## our data:
tall.students <- student.data[student.data$height >
mean(student.data$height),]
## Or sort our data by various aspects:
student.data[order(student.data$class.years),]
## Introduction to factors
## When performing statistical analyses, we often want R to look at a
## set of data and compare groups within the
## data to one another. For example, you have the data frame
## containing data on students in a course. There are two columns of
## data, height and class.year. How can you look at the means of
## height by class.year?
## Or, another example, you have sampled a number of rabbits and have
## a column for weights before a diet treatment and a column for
## weights after a diet treatment and a third column stating the diet
## treatment (e.g, "none,” "grain diet,” and "grapefruit diet”). How
## can you evaluate the change in weight as affected by diet?
## The answer to these questions is to use factors.
## Many of the datasets that come with R already have their data
## interpreted as factors. Let’s take a look at a dataset with
## factors:
data(moths, package="DAAG")
help(moths, package="DAAG")
moths
## (Note that you may have to install the DAAG package in order to
## load these data.) The help file tells us that our last column,
## habitat, is a factor. What does this mean?
## See what happens when we pull up this column by itself:
moths$habitat
## It looks pretty standard, at first, but then we notice that it is
## more than just a list of habitat names – it has another component,
## levels.
## Factors have levels. Levels are editable, independent of the data
## itself. To see the levels alone, you can type
levels(moths$habitat)
## When called that way, it has the identity of a vector of strings.
## The levels() function behaves just like the names() and row.names()
## functions (i.e., weird), and you can make assignments or
## reassignments to the levels
levels(moths$habitat)[1] <- "NEBank"
# Factors come in exceptionally handy when performing statistical
## tests, but the various plot functions can give you an idea of uses
## of a factored variable, such as,
boxplot(moths$meters ~ moths$habitat)
## The tilde, ~, used in a number of contexts in R, can generally be
## read as "by,” which gives a general explanation of its use here –
## visualizing meters by habitat.
## Making a factor
## Now you know how to employ a factored variable, and
## the next step is to know how to make a factor out of a
## variable. The general syntax is:
x <- factor(c("A","B","A","A","A","B"))
## For vectors of strings, like that one. The results are usually fine
## as is.
## But let’s go back to our student.data data frame. We listed
## class.years as a number 1 through 4, but these are discreet
## categories with well-defined names. A more elegant solution is to
## factor the column of the data frame, much like is seen with moths.
student.data$class.years <- factor(student.data$class.years)
levels(student.data$class.years)
## Not ideal, but we can use reassignment to change the names of the
## years.
levels(student.data$class.years) <- c("Freshman", "Sophomore","Junior", "Senior")
## With satisfying (preliminary) results available with:
student.data
boxplot(student.data$heights ~ student.data$class.years)
## In the tutorials, we have been using data contained within R’s
## packages; however, when working on your own research you will most
## likely want to read in a dataset of your own. The read.table()
## function, and a number of related functions designed for reading in
## data in a variety of formats, are essential for importing your own
## data. I suggest trying this out at some point, and I wanted to
## mention a convenient GUI tool for retrieving the data from your
## drive – incorporating file.choose() into the command, as follows.
roo <-read.table(file.choose(),header=TRUE)
## This will let you use yoursystem’s familiar file-selection window
## to locate the data on your drive.
####################
## Applying functions to data frames
####################
## Many functions you might like to apply to your data frames will
## produce unpredictable results.
## A few work nicely:
nyc.air <- airquality[,c("Wind","Temp")]
nyc.air
mean(nyc.air) # Note that this may give an error message; this usage is
# being replaced in future versions of R
summary(nyc.air)
## But others that you might try do not work as you want:
sum(nyc.air) # sums wind and temperature together
var(nyc.air) # gives covariance in a matrix (which we haven’t studied
# yet)
## The solution to these troubles is to use the function sapply(),
## which performs the function named in the second argument on the
## first argument – in a more predictable fashion than seen above.
sapply(nyc.air, sum)
sapply(nyc.air, var)
######################################################################
## SECTION C. Composing your own functions
######################################################################
## A more advanced (and very important) topic
## So far in R we have used the functions that come with R and its various
## packages. You have come up with methods for adjusting your data
## for visualization on your own, but you did this in many separate steps,
## each of which refer to the specific items you are manipulating. Since
## often you want to perform the same series of actions on different objects,
## R makes it relatively easy to compose your own generic functions and
## store them in R’s memory.
## Before you start writing a function you need to have your mind set
## on three things:
## * What you want to give the function as input
## * What you want the function to do
## * What you want the function to give as output
####################
## A trivial example
####################
## Imagine you need to repeatedly transform sets of
## data, but your transformation is "non-standard.” For this example,
## I’m imagining that you want the natural logarithm of the data, plus
## one. We know how to perform these operations on a number we have
## stored in a variable, no problem,
x <- 1:10
log(x)+1
## But what we would really like is a named function which will do
## this in one step, log.plus.one().
## What we will do is make an assignment to log.plus.one, but rather
## than assigning a value (or vector, etc.), we assign a function
## which we define on the spot. We use the command function, which
## looks like a function but is not a function. What is function? It’s
## a control element of the R language – it isn’t executed like a
## function, but rather it informs R to treat the code around it in a
## special way.
## The command function has an interesting syntax. Its arguments are
## the names of variables which will serve as the arguments for your
## function (the first of three bullets, above). Then, after this
## parenthetical bit, comes the meat of the function – what you want
## it to do and what you want it to give back to you (the last two
## bullets, above). In our log.plus.one() case, what we want it to do
## and what we want it to give back happen to be the same thing,
## therefore we can define it very simply, as follows,
log.plus.one <- function(y) log(y)+1
## Cool! Let’s test it out:
log.plus.one(x)
## It behaves just like we would want it to.
####################
## A separate little world
####################
## Wait a second. I used y in my function definition but called the
## function with my variable x as the argument. What happened to y?
y
## The variable is untouched by the function.
## In order to keep functions fully generic, when you give the
## function command, R generates a separate, untouchable variable
## space which has no interactions with your R workspace. This means
## that the names of your function arguments (and any variables
## assigned within your function) can be anything you find convenient
## – there is never any risk of a conflict with your active variables.
####################
## Longer functions
####################
## Either because the function is too complex to be
## executed on a single line or because you want to make the
## function’s methods clearer, you will often generate functions
## longer than one line. For this purpose, R introduces another type
## of bracket, curly brackets, { }. These are control brackets, and
## indicate the contents should be treated as a unit.
## As a final example,
(function(x,y){z <- x^2 + y^2
x+y+z })(0:7, 1)
## Note that the function is written on two lines, but this isn’t an
## issue because of the brackets. Note also that this function is
## anonymous. It is never assigned, but used in place.
## A common tendency when first learning to program is to write code
## in a condensed form (such as the anonymous inline function defined
## above) so that it is difficult to follow what is going on when you
## return to the code later on (or when your instructor is helping you
## find a bug that is keeping your code from working correctly). While
## writing code in this way takes a certain amount of cleverness and
## demonstrates that you have understood the concepts, it is better
## practice to write out your code so that it is easy to follow. This
## includes using plenty of whitespace, to make your code easy to
## read, and thoroughly commenting your commands as you go.
## The example above is therefore better written as follows:
## SUM.VALS.PLUS.SUM.SQS() – function that takes two numerical values
## as input and returns the sum of the values plus the sum of their
## squares:
sum.vals.plus.sum.sqs <- function(x,y)
{
z <- x^2 + y^2 # define z as the sum of the values’ squares
return(x + y + z) # add the values to the sum of their squares
# and return the result as output
}
## Perform the above function with x equal to the numbers from 0 to
## 7 and y equal to 1:
sum.vals.plus.sum.sqs(0:7,1)
######################################################################
######################################################################
## This concludes Tutorial 2. Because there are some advanced topics
## here that require practice to get your head around, you should
## make sure to work through the benchmark questions before you
## move on to Tutorial 3.
##
## Question 1:
##
## R sometimes uses confusingly similar names for distinct concepts.
## Define for yourself: names, factors, levels. When would you use each?
##
## Question 2:
##
## You need a subset of the mtcars dataset that has only every other
## row of data included.
## a. Do this with numerical indexing.
## b. Do this with logical indexing.
##
## Question 3:
##
## Write a function, jumble(), that takes a vector as an argument and
## returns a vector with the original elements in random order.
##
## Question 4:
##
## Write an anonymous inline function, applying it to a data frame with
## sapply().
##