-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathR_Notes.Rmd
1427 lines (1057 loc) · 45 KB
/
R_Notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "R Notes"
author: "Betsy Rosalen"
output:
html_document:
theme: cerulean
# code_folding: show
df_print: paged
toc: true
css: ./reports.css
---
```{r include=FALSE}
library(tidyverse)
library(DATA606)
```
# Data Types in R (vs. Python)
## Primitives
### All R primitives are technically vectors
R | Python | Notes
----- | ----- | -----
`character` | Str |
`complex` | ??? | includes imaginary numbers
`numeric` | Float |
`integer` | Int |
`logical` | Bool |
Function | Description
----- | -----
`is.datatype` | will return TRUE or FALSE
`as.datatype` | will convert from the original datatype to the one specified
`class(x)` | will return the datatype of x
`is.na` | tests for NA (missing) values
`is.null` | tests for NULL values
Can coerce data from lower end without loss of precision to uppper end but not the other way around.
## Non-Primitives
R | Python | Notes
------------- | -------------------------- | --------------------------
`factor` | not available | kinda like Python dictionaries but has levels as well, can be ordered
`date` | datetime |
`vector` | list (all same datatype) | all R primitives are technically vectors and can have length. Values in a vector must be all the same datatype
`list` | list |
`matrix` | not available - Maybe with Pandas? | all data must be of the same type
`data.frame` | not available - Need pandas | Each column can be a different datatype
There is no `is.date` function.
### Note about factors
The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.
It is important that R knows whether it is dealing with a continuous or a categorical variable, as the statistical models you will develop in the future treat both types differently.
# Entering Data in R
## Vectors
R Comand | Description
----------------------- | --------------------------------
`x <- 0:10` | Assigns numbers 0 through 10 to x in a vector
`y <- c(1,2,5,3,7,8,4,9,0)` | Assigns the vector to y
`seq(from , to, by)` | generate a sequence<br>indices <- seq(1, 10, 2) <br># indices is c(1, 3, 5, 7, 9)
`rep(x, ntimes)` | repeat x n times<br>y <- rep(1:3, 2) <br># y is c(1, 2, 3, 1, 2, 3)
### Other Useful Functions for creating and working with Vectors
Function | Description
------------------ | ----------------------------------------
`names(v) <- c('one', 'two', 'three')` | Assigns names to the values in the vector
`names(vector)` | Returns the names of all the values in the vector
`names(v)[3]` | returns the name of the the value in the third index of the vector
`v['one']` | returns the name and the value in the index named one
`v[c("Mon", "Tues", "Wed")]` | returns the name and the value in the indices named "Mon", "Tues", and "Wed"
`length(vector)` | returns the length of the vector
`cut(x, n)` | divide continuous variable in factor with n levels y <- cut(x, 5)
### Creating a selection vector
```{r}
vector <- c(1, 2, -4, 5, -6)
selection_vector <- vector > 0
selection <- vector[selection_vector]
selection
```
## Factors
To create a factor, first create a vector with all your values, then use the `factor()` function to convert it to a factor. To set the levels of an ordinal categorical value while you are creating a factor, use
```
factor(vector, order= TRUE, levels = c("Low", "Medium", "High")
```
To set the levels after the factor is already created, use
```
levels(factor) <- c("name1", "name2",...)
```
You can also use this to change the names of the levels. Watch out: the order with which you assign the levels is matters. Alternatively you can specify the associations like this:
```
levels(factor) <- c('F' = "Female", 'M' = "Male")
```
## Lists
Lists can contain anything! (Just like Python)
```{r}
# Vector with numerics from 1 up to 10
my_vector <- 1:10
# Matrix with numerics from 1 up to 9
my_matrix <- matrix(1:9, ncol = 3)
# First 10 elements of the built-in data frame mtcars
my_df <- mtcars[1:10,]
# names are optional but useful!
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list
```
Indexing lists in R needs double brackets.
```{r}
my_list[[2]]
```
Other indexing syntax...
```{r}
my_list[["vec"]]
my_list$df
my_list[['df']][2:3,]
# chain select by names
my_list[[c("df", "mpg")]]
```
If you have a list of lists `a` and want to add a list `b` to it, you can use `c(a, list(b))`
```{r}
a <- list(1,2,3)
b <- list(4,5,6)
c <- c(a, list(b))
c
```
If you have a list of lists `a` and want to add each element of list `b` to it, you can use `c(a, b)`
```{r}
d <- c(a, b)
d
```
To select an item in a list inside another list use chained selection...
```{r}
c[[c(4,2)]]
```
```{r}
x <- list(a = list(d=1,e=10,f=100), b = list(d=2,e=20,f=200), c = list(d=3,e=30,f=300))
x[["a"]]
`[[`(x, "a")
lapply(x, `[[`, "f")
```
## Matrices and Data Frames
```{r}
# Dataframes can store vectors of different types
x <- 1:3
y <- 4:6
z <- c('seven', 'eight', 'nine')
df <- data.frame(x, y, z, stringsAsFactors = FALSE)
df
names(df) <- c('one', 'two', 'three')
df
# Matrices must be all of the same type
v <- 7:9
c <- c("one", "two", "three")
mat <- matrix(c(x, y, v), byrow = TRUE, nrow = 3)
mat
mat2 <- matrix(c(c, z), byrow = TRUE, nrow = 2)
mat2
names(mat) <- c('one', 'two', 'three')
mat
```
Function | Description
------------------ | --------------------------------------------
`nrow(dataframe)` | returns the number of rows
`ncol(dataframe)` | returns the number of columns
`str(dataframe)` | returns the structure of the dataframe
`dim(dataframe)` | Returns the dimensions of the dataframe
`head(df, 3)` | returns the first 3 rows of the dataframe
`tail(df, 5)` | returns the last 5 rows of the dataframe
`names(dataframe)` | Returns the names of the columns or variabes in the dataframe
`names(df)[3]` | returns the third column name
`names(df) <- c('one', 'two', 'three')` | Assigns names to the columns in the dataframe or **values** in a matrix
`rownames(matrix_df) <- row_names_vector` | Assigns names to the rows in the matrix/dataframe
`colnames(matrix_df) <- col_names_vector` | Assigns names to the columns in the matrix/dataframe
`rownames(dataframe)` | Returns the names of the rows in the dataframe
`colnames(dataframe)` | Returns the names of the columns in the dataframe
`rownames(df) <- NULL` | resets to generic index names
`colnames(df) <- NULL` | resets to generic names
`rowSums(df)` | Just what it sounds like
`colSums(df)` | Just what it sounds like
`rbind(df, df2)` | combines two dataframes or vectors adding the second one as additional rows to the first
`cbind(df, df2)` | combines two dataframes or vectors adding the second one as additional columns to the first
`dataframe$variable` | Returns all of the values in the specified variable as a ***vector***
`df$totals <- df$var1 + df$var2` | Creates a new column and puts the total of var1 and var2 in that column
`df[which.max(df$var),]` | Finds the row with max in specified variable column
### Note about multiplying matrices
You can multiply each element in a matrix by te corresponding element in another matirx using regular operations i.e. `matrix1 * matrix2`
This is not the standard matrix multiplication for which you should use `%*%` in R.
## data.table
```{r}
library(data.table, quietly = TRUE)
x <- 1:3
y <- 4:6
z <- c('seven', 'eight', 'nine')
DT <- data.table(x, y, z) # strings are automatically characters not factors
DT
```
data.table 1.10.4.3
The fastest way to learn (by data.table authors): <https://www.datacamp.com/courses/data-analysis-the-data-table-way>
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: <http://r-datatable.com>
Accessing data is a little different then data.frame.
Function | Description
----- | -----
`DT[1:5, ]` | rows 1 - 5
`DT[A>=7, ]` | All rows where column A >= 7
`DT[ , B]` | only column B
`DT[ , list(B, D)]` | only column B and D
# Variables
## Assignment
`x <- 5` or `5 -> x`
`a <- b <- 36`
`assign('y', 42)`
Variable names can use any conbination of alphanumeric characters, periods and underscores, but they cannot *start* with a number or underscore.
Note - single and double quotes can be used interchangeably like in Python.
## Removing Variables
`remove(var)` or `rm(var)`
# Tables
## Contingency Tables
```{r, eval=FALSE}
devtools::install_github("seankross/lego")
```
```{r}
library(lego)
data(legosets)
table(legosets$Availability, useNA='ifany')
```
```{r}
table(legosets$Availability, legosets$Packaging, useNA='ifany')
```
## Proportional Tables {.flexbox .vcenter}
```{r}
prop.table(table(legosets$Availability))
```
# Base Graphics
## Bar Plots {.flexbox .vcenter}
Good for Categorical Variables
Regular Plot
```{r}
barplot(table(legosets$Availability), las=3)
```
Proportional Plot
```{r}
barplot(prop.table(table(legosets$Availability)), las=3)
```
## Line Graph - Coin Tosses {.flexbox .vcenter}
```{r, fig.width=8, fig.height=3.5}
# Plot cummulative outcome of 1000 coin tosses
coins <- sample(c(-1,1), 1000, replace=TRUE)
plot(1:length(coins), cumsum(coins), type='l')
abline(h=0)
```
## Line Graph - Coin Tosses (Full Range) {.flexbox .vcenter}
```{r, fig.width=8, fig.height=3.5}
# same exact plot but change the y axis to show the total ramge of possibilities
plot(1:length(coins), cumsum(coins), type='l', ylim=c(-1000, 1000))
abline(h=0)
```
## Coin Tosses Revisited {.flexbox .vcenter}
```{r}
# Plot cummulative outcome of 100 coin tosses
coins <- sample(c(-1,1), 100, replace=TRUE)
plot(1:length(coins), cumsum(coins), type='l')
abline(h=0)
# Vaue at the end
cumsum(coins)[length(coins)]
```
## Many Random Samples
```{r}
# Function to do the same as above 1000 times and record the ending value of each 100 coin tosses.
samples <- rep(NA, 1000)
for(i in seq_along(samples)) {
coins <- sample(c(-1,1), 100, replace=TRUE)
samples[i] <- cumsum(coins)[length(coins)]
}
head(samples, 30)
mean(samples)
```
## Mosaic Plot
For two or three Categorical Variables
```{r, message=FALSE}
library(vcd)
mosaic(HairEyeColor, shade=TRUE, legend=TRUE)
```
## Dot Plot {.flexbox .vcenter}
For quantitative variables
```{r, fig.height=2.5}
stripchart(legosets$Pieces)
```
For quantitative variable grouped by a categorical variable
```{r, fig.height=4}
par.orig <- par(mar=c(1,10,1,1))
stripchart(legosets$Pieces ~ legosets$Availability, las=1)
par(par.orig)
```
## Histograms {.flexbox .vcenter}
```{r}
hist(legosets$Pieces)
```
## Transformations
With highly skewed distributions, it is often helpful to transform the data. The log transformation is a common approach, especially when dealing with salary or similar data.
```{r}
hist(log(legosets$Pieces))
```
## Evaluating Normal Approximation
Histogram looks normal, but we can overlay a standard normal curve to help evaluation.
```{r, echo=FALSE, results='hide'}
heights <- c(180.34, 170.18, 175.26, 177.8, 172.72, 160.02, 172.72, 182.88, 177.8, 177.8, 167.64, 180.34, 180.34, 172.72, 165.1, 154.94, 180.34, 172.72, 165.1, 167.64, 182.88, 175.26, 182.88, 177.8, 175.26, 185.42, 175.26, 167.64, 187.96, 175.26, 180.34, 175.26, 198.12, 177.8, 185.42, 175.26, 180.34, 187.96, 182.88, 187.96, 177.8, 182.88, 187.96, 170.18, 182.88, 182.88, 175.26, 170.18, 182.88, 180.34, 180.34, 170.18, 180.34, 187.96, 193.04, 175.26, 193.04, 182.88, 177.8, 167.64, 170.18, 160.02, 172.72, 193.04, 187.96, 190.5, 172.72, 175.26, 193.04, 180.34, 162.56, 187.96, 182.88, 180.34, 177.8, 172.72, 185.42, 180.34, 180.34, 182.88, 185.42, 180.34, 195.58, 185.42, 170.18, 170.18, 172.72, 180.34, 190.5, 172.72, 182.88, 170.18, 177.8, 175.26, 162.56, 162.56, 175.26, 167.64, 170.18, 177.8)/2.54
```
```{r, fig.width=8, fig.height=4}
h <- hist(heights, xlim=c(60, 80))
x <- seq(min(heights)-5, max(heights)+5, 0.01)
y <- dnorm(x, mean(heights), sd(heights))
y <- y * diff(h$mids[1:2]) * length(heights)
lines(x, y, lwd=1.5, col='blue')
```
## Normal Q-Q Plot
```{r, fig.width=5, fig.height=5}
qqnorm(heights, cex=0.5, main='', axes=F, ylab='Male heights (in)', pch=19)
axis(1)
axis(2)
abline(mean(heights), sd(heights), col="blue", lwd=1.5)
```
## Normal Q-Q Plot with simulations
```{r, fig.width=6, fig.height=5}
qqnorm(samples)
DATA606::qqnormsim(samples)
```
## Normal Plot
```{r, fig.width=10, fig.height=5}
normal_plot(mean = 0, sd = 1, cv = c(-1, 1))
```
## Density Plots
```{r}
plot(density(legosets$Pieces, na.rm=TRUE), main='Lego Pieces per Set')
```
## Density Plot (log tansformed)
```{r}
plot(density(log(legosets$Pieces), na.rm=TRUE), main='Lego Pieces per Set (log transformed)')
```
## Box Plots
For quantitative variables
```{r}
scores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
boxplot(scores, horizontal = TRUE)
```
```{r, fig.width=3}
boxplot(legosets$Pieces)
```
```{r, fig.width=3}
boxplot(log(legosets$Pieces))
```
## Scatter Plots
```{r, fig.height=5}
plot(legosets$Pieces, legosets$USD_MSRP)
```
## Examining Possible Outliers (expensive sets)
```{r}
legosets[which(legosets$USD_MSRP >= 400),]
```
## Examining Possible Outliers (big sets)
```{r}
legosets[which(legosets$Pieces >= 4000),]
```
```{r, fig.height=5}
plot(legosets$Pieces, legosets$USD_MSRP)
bigAndExpensive <- legosets[which(legosets$Pieces >= 4000 | legosets$USD_MSRP >= 400),]
text(bigAndExpensive$Pieces, bigAndExpensive$USD_MSRP, labels=bigAndExpensive$Name)
```
# Writing functions and conditional statements
## If Else
```
# If statement alone
if(condition=TRUE){
code to run
}
# If Else statement
if(condition=TRUE){
code to run
} else {
code to run if condition=FALSE
}
# If, Else If, Else statement
if(condition1=TRUE){
code to run if condtion1=TRUE
} else if(condition2=TRUE){
code to run if condtion1=FALSE but condition2=TRUE
} else {
code to run if both conditions=FALSE
}
```
## While Loop
```
while (condition) {
expr
increment
}
```
Can nest if statements inside
```
while (condition) {
if(condition=TRUE){
expr
}
increment
}
```
```{r}
vec <- c(2, 3, 5, 7, 11, 13)
# Option 1
for (el in vec) {
print(el)
}
# Option 2
for (i in 1:length(vec)) {
print(vec[i])
}
# To access and change the elements in the list, you need to use the Option 2 approach above!!!
vec2 <- as.data.frame(vec)
for (i in 1:length(vec2)) {
vec2$date <- Sys.Date()
}
vec2
```
## For Loop
```
for (var in seq) {
expr
}
```
```
for (var in seq) {
if(condition=TRUE){
next #skips this loop if condition is met
}
expr
}
```
## Wrting Custom Functions
```
my_func <- function(arg1, arg2=DEFAULT){
code
}
```
## lapply()
Use to iterate without a for loop
```
# Can take a list or a vector as input
lapply(iterator, function)
# Always returns a list.
# If you don't want a list, do this...
unlist(lapply(iterator, function))
# Returns a vector
# To use lapply with a function that takes more than one argument
lapply(iterator, function, arg)
```
## sapply()
```
# only use if all items are of the same type.
# returns a named vector (unlists automatically)
sapply(vector, function, USE.NAMES=FALSE)
# USE.NAMES arg to get an unnamed vector
```
If each item returns a list of same length it returns a matrix
If each item returns a list of different lengths it returns a list of lists
# Built-in Functions
This section taken from <https://www.statmethods.net/management/functions.html> with lots of additions by me.
## Getting Started and Finding Help
**Tip:** If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.
R Comand | Description
-------------------- | ----------------------------------------
`ctrl + L` | to clear console
`ls()` | list the objects in memory to the console
`library()` | Lists the packages in your library
`search()` | Shows packages that are currently active
`install.package(package_name)` | installs package
`library(package_name)` or <br>`require(package_name)` | loads package into memory
`?function_name` | Displays the documentation for the function in the viewer window in RStudio
`apropos('func')` | to search for a function by only part of the name
`args(func)` | To get information about the function arguments
`search()` | To see a list of loaded packages (when you load a package you are adding it to your search list)
`getwd()` | get the working directory
`setwd("file\path")` | set the working directory
`vignette(package="package_name")` | To see a list of 'vignettes' or sample code for a package
`vignette("vin_name", package="pack_name")` | to view a specific vignette
`data(package="package_name")$results` | to see the datasets that come with a package
`data(dataset_name)` | to load data - note it does not show up in the "Data" section of your environment in RStudio until you use the data in another function like `head(data)` or `dim(data)`
#### If you can't install packages reset your default CRAN Mirror
`options(repos = c(CRAN = "http://cran.rstudio.com"))`
**Note:** Putting parenthases around your code is equivalent to the print function
## Numeric Functions
Function | Description
--------------- | ----------------------------------------
`+`, `-`, `*`, `/` | addition, subtraction, multiplication, division
`x %% y` | modulo or remainder of division of x by y
`x %/% y` | integer division - number of times y goes into x without remainder
`x ^ y` (or `x ** y`) | exponentiation - x raised to the power y
`abs(x)` | absolute value
`sqrt(x)` | square root
`ceiling(x)` | ceiling(3.475) is 4
`floor(x)` | floor(3.475) is 3
`trunc(x)` | trunc(5.99) is 5
`round(x, digits=n)` | round(3.475, digits=2) is 3.48
`signif(x, digits=n)` | signif(3.475, digits=2) is 3.5
`cos(x)`, sin(x), tan(x) | also acos(x), cosh(x), acosh(x), etc.
`log(x)` | natural logarithm
`log10(x)` | common logarithm
`exp(x)` | e^x
## Logical Operators
Operator | Description
---------- | ---------------------------------------------
`<` | less than
`<=` | less than or equal to
`>` | greater than
`>=` | greater than or equal to
`==` | exactly equal to
`!=` | not equal to
`!x` | Not x
`x | y` | x OR y
`x & y` | x AND y
`x %in% c(a, b, c)` | TRUE if x is in the vector c(a, b, c)
`isTRUE(x)` | test if X is TRUE
`any(v1 < v2)` | checks if any item in a vector is less than the corresponding item in a second vector
`all(v1 < v2)` | checks if all items in a vector are less than the corresponding items in a second vector
`identical(x, y)` | checks if the two items are identical
## Character Functions
Function | Description
--------------- | ----------------------------------------
`nchar(x)` | returns the number of characters in x (works on character and numeric datatypes even withint vectors, will not work on factors)
`toupper(x)` | Uppercase
`tolower(x)` | Lowercase
`substr(x, start=n1, stop=n2)` | Extract or replace substrings in a character vector.<br>x <- "abcdef" <br>substr(x, 2, 4) is "bcd" <br>substr(x, 2, 4) <- "22222" is "a222ef"
`grep(pattern, x , ignore.case=FALSE, fixed=FALSE)` | Search for pattern in x. If fixed=FALSE then pattern is a regular expression. If fixed=TRUE then pattern is a text string. Returns matching indices.<br>grep("A", c("b","A","c"), fixed=TRUE) returns 2
`sub(pattern, replacement, x, ignore.case =FALSE, fixed=FALSE)` | Find pattern in x and replace with replacement text. If fixed=FALSE then pattern is a regular expression.<br>If fixed = T then pattern is a text string. <br>sub("\\s",".","Hello There") returns "Hello.There"
`gsub(pattern, replacement, x)` | Same as sub but replaces all not just first occurance in each item in your list
`strsplit(x, split)` | Split the elements of character vector x at split.
`strsplit("abc", "")` | returns 3 element vector "a","b","c"<br>
`paste(..., sep="")` | Concatenate strings after using sep string to seperate them.<br>paste("x",1:3,sep="") returns c("x1","x2" "x3")<br>paste("x",1:3,sep="M") returns c("xM1","xM2" "xM3")<br>paste("Today is", date())<br>paste(Year, Month, DayofMonth, sep="-")
## Basic Statistical Functions
Basic statistical functions are provided in the following table. Each has the option na.rm to strip missing values before calculations. Otherwise the presence of missing values will lead to a missing result. Object can be a numeric vector or data frame.
Function | Description
-------------------- | ----------------------------------------
`min(x)` | minimum
`max(x)` | maximum
`sum(x)` | sum
`cumsum(x)` | running total (cummulative sum)
`diff(x)` | difference
`range(x)` | range
`mean(x, trim=0,<br>na.rm=FALSE)` | mean of object x<br># trimmed mean, removing any missing values and <br># 5 percent of highest and lowest scores <br>mx <- mean(x,trim=.05,na.rm=TRUE)
`median(x)` | median
`var(x)` | variance
`sd(x)` | standard deviation of object(x). <br>also look at var(x) for variance and mad(x) for median absolute deviation.
`summary(x)` | Returns Min, Max, 1st Qtr, 3rd Qtr, Median, Mean and num of missing values - N0 SD. Can be used with factors, but not categorical vectors
`quantile(x, probs)` | quantiles where x is the numeric vector whose quantiles are desired <br>and probs is a numeric vector with probabilities in [0,1].<br># 30th and 84th percentiles of x<br>y <- quantile(x, c(.3,.84))
`fivenum(x)` | min, 1st, 2nd, 3rd Quartiles, and Max
`IQR(x)` | Spread between 25th and 75th percentile
`rank(x)` | takes a group of values and calculates the rank of each value within the group
`diff(range(x))` | total range of vector x
`diff(x, lag=1)` | lagged differences, with lag indicating which lag to use
`scale(x, center=TRUE, scale=TRUE)` | column center or standardize a matrix.
**NOTE:** adding `na.rm=TRUE` will ignore missing values in most functions above
### The `psych` Package
```{r, message=FALSE, warning=FALSE}
library(psych)
describe(legosets$Pieces, skew=FALSE)
describeBy(legosets$Pieces, group = legosets$Availability, skew=FALSE, mat=TRUE)
```
## Statistical Probability Functions
The following table describes functions related to probaility distributions. For random number generators below, you can use set.seed(1234) or some other integer to create reproducible pseudo-random numbers.
Function | Description
-------------------- | ----------------------------------------
`dt(x, df)` |
`pt(x, df)` | # Example 1: Find the area to the left of a t-statistic with value of -0.785 and 14 degrees of freedom.<br>pt(-0.785, 14)<br><br># Example 2: Find the area to the right of a t-statistic with value of -0.785 and 14 degrees of freedom.<br>#the following approaches produce equivalent results<br># 1 minus area to the left<br>1 - pt(-0.785, 14)<br><br># area to the right<br>pt(-0.785, 14, lower.tail = FALSE)<br><br>pt(t-score) = probability (p-value)
`qt(x, df)` | #find the t-score of the 99th quantile of the Student t distribution with df = 20<br>qt(.99, df = 20)<br>#find the t-score of the 95th quantile of the Student t distribution with df = 20<br>qt(.95, df = 20)<br>qt(p-value) = t-test_statistic (t-score)
`rt(x, df)` |
`dnorm(x)` | normal density function (by default m=0 sd=1)<br># plot standard normal curve<br>x <- pretty(c(-3,3), 30)<br>y <- dnorm(x)<br>plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxs="i")
`pnorm(q)` | cumulative normal probability for q <br>(area under the normal curve to the left of q)<br>`pnorm(1.96)` is 0.975<br>pnorm(z-score) = probability (p-value)
`qnorm(p)` | normal quantile. <br>value at the p percentile of normal distribution <br>`qnorm(.9)` is 1.28 # 90th percentile<br>qnorm(p-value) = z-score
`rnorm(n, m=0,sd=1)` | n random normal deviates with mean m and standard deviation sd. <br># 50 random normal variates with mean=50, sd=10<br>x <- rnorm(50, m=50, sd=10)
`dbinom(1, 4, 0.35)` | The **probability** of getting exactly one success in 4 trials with 0.35 probability of success
`choose(4,1)` | The **number of ways** to get 1 success in 4 trials - computes the combination $_4C_1$
`dbinom(x, size, prob)`<br>`pbinom(q, size, prob)`<br>`qbinom(p, size, prob)`<br>`rbinom(n, size, prob)` | binomial distribution where size is the sample size and prob is the probability of a heads<br># prob of 0 to 5 heads of fair coin out of 10 flips<br>dbinom(0:5, 10, .5)
`pbinom(5, 10, .5)`<br>`dpois(x, lamda)`<br>`ppois(q, lamda)`<br>`qpois(p, lamda)<br>rpois(n, lamda)` | poisson distribution with m=std=lamda<br># probability of 0,1, or 2 events with lamda=4<br>dpois(0:2, 4)<br># probability of at least 3 events with lamda=4 <br>1- ppois(2,4)
`dunif(x, min=0, max=1)`<br>`punif(q, min=0, max=1)`<br>`qunif(p, min=0, max=1)`<br>`runif(n, min=0, max=1)` | uniform distribution, follows the same pattern as the normal distribution above. <br># 10 uniform random variates<br>x <- runif(10)
Note that while the examples on this page apply functions to individual variables, many can be applied to vectors and matrices as well.
### Combinations vs. Permuations
A combination does not take into account the order, whereas a permutation does. Using the example from mathsisfun.com:
- A fruit salad is a combination of apples, bananas and grapes, since it's the same fruit salad regardless of the order of fruits
- To open a safe you need the right order of numbers, thus the code is a permutation
[Great explanation of combinations and permutations including how to calculate in R](https://davetang.org/muse/2013/09/09/combinations-and-permutations-in-r/)
### The R probability functions
Really good explanation at [seankross.com/notes/dpqr/](http://seankross.com/notes/dpqr/)
<https://www.unc.edu/courses/2008fall/ecol/563/001/images/lectures/lecture3/lecture3.htm#probfunc>
![Fig. 3 The four probability functions for the normal distribution](Images/RNormFunctions.jpg)
There are four basic probability functions for each probability distribution in R. R's probability functions begins with one of four prefixes: d, p, q, or r followed by a root name that identifies the probability distribution. For the normal distribution the root name is "norm". The meaning of these prefixes is as follows.
- **d** is for "density" and the corresponding function returns the value from the probability density function (continuous) or probability mass function (discrete).
- **p** is for "probability" and the corresponding function returns a value from the cumulative distribution function.
- **q** is for "quantile" and the corresponding function returns a value from the inverse cumulative distribution function.
- **r** is for "random and the corresponding function returns a value drawn randomly from the given distribution.
To better understand what these functions do we'll focus on the four probability functions for the normal distribution: dnorm, pnorm, qnorm, and rnorm. Fig. 3 illustrates the defining relationships among these four functions.
- **dnorm** is the normal probability density function. Without any further arguments it returns the density of the standard normal distribution. If you plot dnorm(x) over a range of x-values you obtain the usual bell-shaped curve of the normal distribution. In Fig. 3, the value of dnorm(2) is indicated by the height of the vertical red line segment. It's the just the y-coordinate of the normal curve when x = 2. ***Keep in mind that density values are not probabilities.*** To obtain probabilities one needs to integrate the density function over an interval. Alternatively if we consider a very small interval, say one of width $\Delta x$, and if f(x) is a probability density function, then it is the case that $P(x<X<x+\Delta x)$.
- **pnorm** is the cumulative distribution function for the normal distribution. By definition pnorm(x) = P(X < x) and is the area under the normal density curve to the left of x. Fig. 3 shows pnorm(2), the area under the normal density curve to the left of x = 2. As is indicated on the figure, this area is 0.977. So the probability that a standard normal random variate takes on a value less than or equal to 2 is 0.977
- **qnorm** is the quantile function of the standard normal distribution. If qnorm(x) = k then k is the value such that P(X < k) = x . qnorm is the inverse function for pnorm. From Fig. 3 we have, qnorm(0.977) = qnorm(pnorm(2)) = 2.
- **rnorm** generates random values from a standard normal distribution. The required argument is a number specifying the number of normal variates to produce. Fig. 3 illustrates rnorm(20), the locations of 20 random realizations from the standard normal distribution, jittered slightly to prevent overlap.
## Creating Tables and Graphs
Function | Description
-------------------- | ----------------------------------------
`table(df$var1, useNA='ifany')` | Creates a table of sums of each value for the variable
`table(df$var1, df$var2, useNA='ifany')` | Creates a table of sums of the inersection of the two variables
`prop.table(table(df$var))` | Creates a table of the proportion of each value for the variable
`barplot(table(df$var), las=3)` | Creates a bargraph of the values of the variable
`plot(x = df$var1, y = df$var2)` | Creates a scatterplot of the two variables<br> Technically you don't need the x= and y= as long as you put them first and in that order because by default the first 2 arguments are for the x and y variables
`plot(df$var1, df$var2, type = "l")` | Creates a linegraph of the two variables
`hist(x)` | Creates a histogram of the single variable x
# Reading data into R
## From CSV
`read.csv("file/location", header = F)`
Must use double backslash or forward slashes.
`header = F` means the original file has no header
## From SPSS
Use Foreign Package
```
install.packages("foreign")
library(foreign)
df <- read.spss("file/location", to.data.frame=T, use.value.labels=T)
```
## From GitHub
Here is some sample code for reading R from a dataset that has been posted in a GitHub repository:
```
library(RCurl)
x <- getURL("https://raw.github.com/aronlindberg/latent_growth_classes/master/LGC_data.csv")
y <- read.csv(text = x)
```
source: <http://stackoverflow.com/questions/14441729/read-a-csv-from-github-into-r>
**Make sure you copy the RAW data URL location.**
# Generating Random Numbers
For uniformly distributed (flat) random numbers, use runif(). By default, its range is from 0 to 1.
```
# Generate a random number from 0 to 1
runif(1)
#> [1] 0.09006613
# Get a vector of 4 random numbers from 0 to 1
runif(4)
#> [1] 0.6972299 0.9505426 0.8297167 0.9779939
# Get a vector of 3 numbers from 0 to 100
runif(3, min=0, max=100)
#> [1] 83.702278 3.062253 5.388360
# Get 3 integers from 0 to 100
# Use max=101 because it will never actually equal 101
floor(runif(3, min=0, max=101))
#> [1] 11 67 1
# This will do the same thing
sample(1:100, 3, replace=TRUE)
#> [1] 8 63 64
# To generate integers WITHOUT replacement:
sample(1:100, 3, replace=FALSE)
#> [1] 76 25 52
```
To generate numbers from a normal distribution, use rnorm(). By default the mean is 0 and the standard deviation is 1.
```
rnorm(4)
#> [1] -2.3308287 -0.9073857 -0.7638332 -0.2193786
# Use a different mean and standard deviation
rnorm(4, mean=50, sd=10)
#> [1] 59.20927 40.12440 44.58840 41.97056
# To check that the distribution looks right, make a histogram of the numbers
x <- rnorm(400, mean=50, sd=10)
hist(x)
```
If you want to generate a sequence of random numbers, and then generate that same sequence again later, use set.seed(), and pass in a number as the seed.
```
set.seed(423)
runif(3)
#> [1] 0.1089715 0.5973455 0.9726307
set.seed(423)
runif(3)
#> [1] 0.1089715 0.5973455 0.9726307
```
# Random Samples
Use the sample( ) function to take a random sample of size n from a dataset.
```
# take a random sample of size 50 from a dataset mydata
# sample without replacement
mysample <- mydata[sample(1:nrow(mydata), 50,
replace=FALSE),]
```
# Measuring elapsed time
The system.time() function will measure how long it takes to run a particular block of code in R.
```
system.time({
# Do something that takes time
x <- 1:100000
for (i in seq_along(x)) x[i] <- x[i]+1
})
#> user system elapsed
#> 0.144 0.002 0.153
```
The output means it took 0.153 seconds to run the block of code.
# Subsetting Data
R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations. The following code snippets demonstrate ways to keep or delete variables and observations and to take random samples from a dataset.
## Selecting (Keeping) Variables
```
# select variables v1, v2, v3
myvars <- c("v1", "v2", "v3")
newdata <- mydata[myvars]
# another method same as above
myvars <- paste("v", 1:3, sep="")
newdata <- mydata[myvars]
# select 1st and 5th thru 10th variables
newdata <- mydata[c(1,5:10)]
```
To practice this interactively, try the [selection of data frame elements exercises](https://campus.datacamp.com/courses/free-introduction-to-r/chapter-5-data-frames?ex=6) in the Data frames chapter of this [introduction to R course](https://www.datacamp.com/courses/free-introduction-to-r).
## Excluding (DROPPING) Variables
```
# exclude variables v1, v2, v3
myvars <- names(mydata) %in% c("v1", "v2", "v3")
newdata <- mydata[!myvars]
# exclude 3rd and 5th variable
newdata <- mydata[c(-3,-5)]
# delete variables v3 and v5
mydata$v3 <- mydata$v5 <- NULL
```
## Selecting Observations in a data.frame
```
# first 5 observations
newdata <- mydata[1:5, ]
# first 5 variables/columns
newdata <- mydata[ ,1:5]
# row 2 column 6
newdata <- mydata[2,6]
# row 2 and 4 column 6
newdata <- mydata[c(2,4),6]
# based on variable values
# which same as where clause in SQL
newdata <- mydata[ which(mydata$gender=='F'
& mydata$age > 65), ]
# or
attach(mydata)
newdata <- mydata[ which(gender=='F' & age > 65),]
detach(mydata)
# with allows us to specify the columns of a data.frame without having to specify the data.frame name each time...
baseball$OBP <- with(baseball, (h + bb + hbp) / (ab + bb + hbp + sf))
```
## Selection using the Subset Function
The subset( ) function is the easiest way to select variables and observations. In the following example, we select all rows that have a value of age greater than or equal to 20 or age less then 10. We keep the ID and Weight columns.
```
# using subset function
newdata <- subset(mydata, age >= 20 | age < 10,
select=c(ID, Weight))
```
In the next example, we select all men over the age of 25 and we keep variables weight through income (weight, income and all columns between them).
```
# using subset function (part 2)
newdata <- subset(mydata, sex=="m" & age > 25,
select=weight:income)