-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path13-appendix.Rmd
1735 lines (1382 loc) · 60 KB
/
13-appendix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r include = FALSE}
RUN_MODEL <- ifelse(
length(system("echo $GITHUB_TOKEN", intern=TRUE)) <= 1,
TRUE,
FALSE
)
```
# (APPENDIX) Appendix {-}
# Solutions {#solutions}
## Getting Started
### Dimensions of a circle {-}
- Given the radius of a circle write a few lines of code that calculates its area and its circumference. Run your code with different values assigned of the radius.
```{r}
radius <- 1
area <- pi * radius^2
circum <- 2 * pi * radius
```
- Print the solution as text.
```{r}
print(paste("Radius:", radius, " Circumference:", circum))
```
### Sequence of numbers {-}
Generate a sequence of numbers from 0 and $\pi$ as a vector with length 5.
```{r}
seq(0, pi, length.out = 5)
```
### Gauss sum {-}
Rumors have it that young Carl Friedrich Gauss was asked in primary school to calculate the sum of all natural numbers between 1 and 100. He did it in his head in no time. We're very likely not as intelligent as young Gauss. But we have R. What's the solution?
```{r}
sum(1:100)
```
Gauss calculated the sum with a trick. The sum of 100 and 1 is 101. The sum of 99 and 2 is 101. You do this 50 times, and you get $50 \times 101$. Demonstrate Gauss' trick with vectors in R.
```{r}
vec_a <- 1:50
vec_b <- 100:51
vec_c <- vec_a + vec_b
# each element is 101
vec_c
# the length of vectors is fifty. 50 * 101
sum(vec_c)
```
### Magic trick algorithm {-}
Define a variable named `x` that contains an integer value and perform the following operations in sequence:
- Redefine `x` by adding 1.
- Double the resulting number, over-writing `x`.
- Add 4 to `x` and save the result as `x`.
- Redefine `x` as half of the previous value of `x`.
- Subtract the originally chosen arbitrary number from `x`.
Print `x`. Restart the algorithm defined above by choosing a new arbitrary natural number.
```{r}
x <- -999 # arbitrary integer
x_save <- x # save for the last step
x <- x + 1
x <- x * 2
x <- x + 4
x <- x / 2
x - x_save
```
### Vectors {-}
Print the object `datasets::rivers` and consult the manual of this object.
- What is the class of the object?
- What is the length of the object?
- Calculate the mean, median, minimum, maximum, and the 33%-quantile across all values.
```{r}
class(datasets::rivers)
length(datasets::rivers)
mean(datasets::rivers)
quantile(datasets::rivers, probs = 0.33)
```
### Data frames {-}
Print the object `datasets::quakes` and consult the manual of this object.
- Determine the dimensions of the data frame using the respective function in R.
- Extract the vector of values in the data frame that contain information about the Richter Magnitude.
- Determine the value largest value in the vector of event magnitudes.
- Determine the geographic position of the epicenter of the largest event.
```{r}
dim(datasets::quakes)
vec <- datasets::quakes$mag
max(vec)
idx <- which.max(vec) # index of largest value
# geographic positions defined by longitude and latitude (columns long and lat)
datasets::quakes$long[idx]
datasets::quakes$lat[idx]
```
### Workspace {-}
*No solutions provided.*
## Programming primers
### Gauss variations {-}
```{r}
# for-loop to compute sum from 1 - 100
sum <- 0
for (i in 1:100){
sum <- sum + i # for-loop iterating from 1 to 100
}
print(sum)
```
```{r}
# while-loop to compute sum from 1 - 100
loop_status <- TRUE
counter <- 0
sum <- 0
while (loop_status) { # while-loop is repeated as long as loop_status is true
counter <- counter + 1
sum <- sum + counter
if (counter == 100) loop_status <- FALSE
}
print(sum)
```
```{r}
# Initiate sum variable
sum <- 0
# Go through loop from 1 to 100
for (i in seq(100)) {
# Check if the current number a muliple of three and seven
# The modulo operator '%%' returns the remainder of a division
if (i %% 3 == 0 && i %% 7 == 0 ) {
sum <- sum + i
}
}
print(paste0("The sum of multiples of 3 and 7 within 1-100 is: ", sum))
```
### Nested loops {-}
```{r}
mymat <- matrix(c(6, 7, 3, NA, 15, 6, 7,
NA, 9, 12, 6, 11, NA, 3,
9, 4, 7, 3, 21, NA, 6,
rep(NA, 7)),
nrow = 4, byrow = TRUE)
myvec <- c(8, 4, 12, 9, 15, 6)
```
```{r message=FALSE}
# Loop over the rows in `mymat`.
for (i in 1:nrow(mymat)){
# Loop over the columns in `mymat`.
for (j in 1:ncol(mymat)){
# Check if current value is missing, if so overwrite with max in 'myvec'
if (is.na(mymat[i,j])){
mymat[i,j] <- max(myvec)
}
}
myvec <- myvec[-which.max(myvec)] # update the vector removing the maximum value
}
mymat
```
### Interpolation {-}
```{r}
# Set up vector as required in the exercise
vec <- rep(NA, 100) # initialize vector of length 100 with NA
vec[1:25] <- 6 # populate first 25 elements of 'vec' with 6.
vec[66:100] <- -20 # populate elements 66:100 with -20.
# Determine index of last non-missing value before gap
last_non_na <- 1
while (!is.na(vec[last_non_na+1])) last_non_na <- last_non_na + 1
# determine index of first non-missing value after gap
first_non_na <- last_non_na + 1
while (is.na(vec[first_non_na])) first_non_na <- first_non_na + 1
# Get the increment that is needed for interpolation
last_value <- vec[last_non_na] # Last non-NA value
first_value <- vec[first_non_na] # First non-NA value
delta <- (last_value - first_value) / (last_non_na - first_non_na) # Change in y over change in x
# fill missing values incrementally
for (i in 2:length(vec)){
if (is.na(vec[i])) vec[i] <- vec[i-1] + delta
}
plot(vec)
# or short using the approx() function:
vec <- rep(NA, 100) # initialize vector of length 100 with NA
vec[1:25] <- 6 # populate first 25 elements of 'vec' with 6.
vec[66:100] <- -20 # populate elements 66:100 with -20.
vec <- approx(1:100, vec, xout = 1:100)
plot(vec)
```
## Data wrangling
### Star wars {-}
{dplyr} comes with a toy dataset `dplyr::starwars` (just type it into the console to see its content). Have a look at the dataset with `View()`. Play around with the dataset to get familiar with the {tidyverse} coding style. Use (possibly among others) the functions `dplyr::filter()`, `dplyr::arrange()`, `dplyr::pull()`, `dplyr::select()`, `dplyr::desc()` and `dplyr::slice()` to answer the following question:
- How many pale characters come from the planets Ryloth or Naboo?
```{r message=FALSE}
dplyr::starwars |>
dplyr::filter(
skin_color == "pale" &
(homeworld == "Naboo" | homeworld == "Ryloth")
) |>
nrow()
```
- Who is the oldest among the tallest thirty characters?
```{r message=FALSE}
dplyr::starwars |>
arrange(desc(height)) |>
slice(1:30) |>
arrange(birth_year) |>
slice(1) |>
pull(name)
```
- What is the name of the shortest character and their starship in "Return of the Jedi"?
```{r message=FALSE}
dplyr::starwars |>
unnest(films) |>
filter(films == "Return of the Jedi") |>
unnest(starships) |>
arrange(height) |>
slice(1) |>
select(name, starships)
```
### Aggregating {-}
You have learned about aggregating in the {tidyverse}. Let's put it in practice.
- Reuse the code in the tutorial to read, reduce, and aggregate the `half_hourly_fluxes` dataset to the daily scale, calculating the following metrics across half-hourly `VPD_F` values within each day: mean, 25% quantile, and 75% quantile.
```{r message=FALSE, warning=FALSE}
# read half hourly fluxes
half_hourly_fluxes <- readr::read_csv(
"data/FLX_CH-Lae_FLUXNET2015_FULLSET_HH_2004-2006.csv"
)
# Select only variables that we are interested in
half_hourly_fluxes <- half_hourly_fluxes |>
dplyr::select(
starts_with("TIMESTAMP"),
ends_with("_F"),
GPP_NT_VUT_REF,
NEE_VUT_REF_QC,
-starts_with("SWC_F_MDS_"),
-contains("JSB")
)
# Clean the datetime objects
# and aggregate to daily scale
daily_fluxes <- half_hourly_fluxes |>
dplyr::mutate(
date_time = lubridate::ymd_hm(TIMESTAMP_START),
date = lubridate::date(date_time)) |>
dplyr::group_by(date) |>
dplyr::summarise(
mean = mean(VPD_F),
q25 = quantile(VPD_F, probs = 0.25),
q75 = quantile(VPD_F, probs = 0.75)
)
```
- Retain only the daily data for which the daily mean VPD is among the upper or the lower 10% quantiles.
```{r}
# In two steps. First, get thresholds of the quantiles
thresholds <- quantile(daily_fluxes$mean, probs = c(0.1, 0.9))
# Then, filter data to be above/below the upper/lower quantiles and combine
daily_fluxes_sub <- daily_fluxes |>
# in lower 10% quantile
filter(mean < thresholds[1]) |>
mutate(qq = "lower") |> # add label
# combine
bind_rows(
daily_fluxes |>
# in upper 90% quantile
filter(mean > thresholds[2]) |>
mutate(qq = "upper")
)
```
- Calculate the mean of the 25% and the mean of the 75% quantiles of half-hourly VPD within the upper and lower 10% quantiles of mean daily VPD.
```{r}
daily_fluxes_sub |>
group_by(qq) |>
summarise(
q25_mean = mean(q25),
q75_mean = mean(q75)
)
```
### Patterns in data quality {-}
The uncleaned dataset `FLX_CH-Lae_FLUXNET2015_FULLSET_HH_2004-2006.csv` holds half-hourly data that is sometimes of poor quality. Investigate whether NEE data quality is randomly spread across hours in a day by calculating the proportion of (i) actually measured data, (ii) good quality gap-filled data, (iii) medium quality data, and (iv) poor quality data within each hour-of-day (24 hours per day).
```{r message=FALSE, warning=FALSE}
# using half_hourly_fluxes read above
daily_fluxes <- half_hourly_fluxes |>
mutate(TIMESTAMP_START = lubridate::ymd_hm(TIMESTAMP_START)) |>
mutate(hour_of_day = lubridate::hour(TIMESTAMP_START)) |>
group_by(hour_of_day) |>
summarise(n_measured = sum(NEE_VUT_REF_QC == 0),
n_good = sum(NEE_VUT_REF_QC == 1),
n_medium = sum(NEE_VUT_REF_QC == 2),
n_poor = sum(NEE_VUT_REF_QC == 3),
n_total = n()
) |>
mutate(f_measured = n_measured / n_total,
f_good = n_good / n_total,
f_medium = n_medium / n_total,
f_poor = n_poor / n_total,
)
```
Interpret your findings: Are the proportions evenly spread across hours in a day?
```{r}
# this is not asked for but interesting. More on data visualisation in Chapter 5
# you can also just look at values of df$f_measured over the course of a day (hod)
daily_fluxes |>
pivot_longer(c(f_measured, f_good, f_medium, f_poor),
names_to = "quality",
values_to = "fraction") |>
ggplot(aes(x = hour_of_day,
y = fraction * 100, # *100 to get percentages
color = quality)) +
geom_line(linewidth = 1.5) + # make lines bit bigger
theme_classic() + # Pick a nice theme
scale_color_brewer( # Pick a nice color palette
"Quality", # Give legend a title
labels = c("Good gap filled data", "Measured data", "Medium gap filled data", "Poor gap filled data"), # Give legend levels a label
palette = 3, # Pick color palette
direction = -1 # Inverse order of color palette
) +
labs(
title = "Temporal pattern of GPP quality",
y = "Fraction of total GPP entries [%]",
x = "Hour of Day"
)
```
Perform an aggregation of the half-hourly GPP data (variable `GPP_NT_VUT_REF`) to daily means of the unmodified data read from file `FLX_CH-Lae_FLUXNET2015_FULLSET_HH_2004-2006.csv`, and from cleaned data where only measured (not gap-filled) half-hourly data is kept and aggregated. This yields two data frames with daily GPP data. Calculate the overall mean GPP for the two data frames (across all days in the data frame). Are the overall mean GPP values equal? If not, why?
```{r message=FALSE, warning=FALSE}
daily_fluxes_all <- half_hourly_fluxes |>
dplyr::mutate(
date_time = lubridate::ymd_hm(TIMESTAMP_START),
date = lubridate::date(date_time)
) |>
dplyr::group_by(date) |>
dplyr::summarise(
GPP_NT_VUT_REF = mean(GPP_NT_VUT_REF, na.rm = TRUE)
)
daily_fluxes_cleaned <- half_hourly_fluxes |>
dplyr::mutate(
date_time = lubridate::ymd_hm(TIMESTAMP_START),
date = lubridate::date(date_time)
) |>
dplyr::mutate(
GPP_NT_VUT_REF = ifelse(NEE_VUT_REF_QC == 0, GPP_NT_VUT_REF, NA)
) |>
dplyr::group_by(date) |>
dplyr::summarise(
GPP_NT_VUT_REF = mean(GPP_NT_VUT_REF, na.rm = TRUE)
)
# overall means
daily_fluxes_all |>
summarise(
GPP_NT_VUT_REF = mean(GPP_NT_VUT_REF, na.rm = TRUE)
)
daily_fluxes_cleaned |>
summarise(
GPP_NT_VUT_REF = mean(GPP_NT_VUT_REF, na.rm = TRUE)
)
```
<!-- ### Tidying an Excel-file {-} -->
<!-- *No solutions provided because part of the final report* -->
## Data Visualisation
### Spurious data {-}
In Section \@ref(baddata), we discovered that certain values of `GPP_NT_VUT_REF` in the half-hourly data `half_hourly_fluxes` (to be read from file `data/FLX_CH-Lae_FLUXNET2015_FULLSET_HH_2004-2006.csv`) are repeated with a spuriously high frequency. Determine all values of `GPP_NT_VUT_REF` that appear more than once in `half_hourly_fluxes` and label them as being "spurious". Visualise the time series of the first two years of half-hourly GPP, mapping the information whether the data is spurious or not to the *color* aesthetic.
```{r}
# Read and wrangle data
half_hourly_fluxes <- readr::read_csv("data/FLX_CH-Lae_FLUXNET2015_FULLSET_HH_2004-2006.csv") |>
# set all -9999 to NA
dplyr::mutate(dplyr::across(dplyr::where(is.numeric),
~dplyr::na_if(., -9999))) |>
# interpret all variables starting with TIMESTAMP as a date-time object
dplyr::mutate_at(vars(starts_with("TIMESTAMP_")), lubridate::ymd_hm)
# determine spurious GPP_NT_VUT_REF values as those that are duplicated
# this creates a logical vector specifying whether the respective row has a
# duplicate
vec_spurious <- half_hourly_fluxes |>
# by keeping only one column, duplicated() determines duplications in
# that variable only
select(GPP_NT_VUT_REF) |>
duplicated()
# label spurious half-hourly data
half_hourly_fluxes <- half_hourly_fluxes |>
mutate(spurious = vec_spurious)
# visualise
ggplot(
data = half_hourly_fluxes |> slice(1:(48*365)),
aes(x = TIMESTAMP_START, y = GPP_NT_VUT_REF)) +
geom_line() +
geom_point(aes(color = spurious), size = 0.9) +
labs(x = "Time",
y = expression(paste("GPP (", mu,"mol CO"[2], " m"^-2, "s"^-1, ")"))) +
scale_color_viridis_d() + # inverse color scale is more intuitive here
theme_classic()
```
Then aggregate half-hourly to daily data, taking the mean of `GPP_NT_VUT_REF` and recording the proportion of underlying half-hourly data points that are "spurious". Visualise the time series of daily `GPP_NT_VUT_REF` with the color scale indicating the proportion of spurious half-hourly data that was used for determining the respective date's mean GPP.
```{r}
# aggregate
daily_fluxes <- half_hourly_fluxes |>
mutate(date = lubridate::date(TIMESTAMP_START)) |>
group_by(date) |>
summarise(frac_spurious = sum(spurious)/48,
GPP_NT_VUT_REF = mean(GPP_NT_VUT_REF))
# visualise
ggplot(
data = daily_fluxes,
aes(x = date, y = GPP_NT_VUT_REF)) +
geom_line() +
geom_point(aes(color = frac_spurious), size = 0.9) +
labs(x = "Time",
y = expression(paste("GPP (", mu,"mol CO"[2], " m"^-2, "s"^-1, ")"))) +
scale_color_viridis_c(direction = -1) + # inverse color scale is more intuitive here
theme_classic()
```
### Identifying Outliers {-}
A key part of data cleaning is to detect and understand outliers. Visualisations can help. Your task here is to find outliers in `GPP_NT_VUT_REF`.
First, using the half-hourly fluxes data, determine "outliers" as those values of `GPP_NT_VUT_REF` that fall outside $( Q_1 - 1.5 (Q_3 - Q_1)$ to $Q_3 + 1.5 (Q_3 - Q_1)$. Plot `GPP_NT_VUT_REF` versus shortwave radiation and highlight outliers in red.
> Hint: Use `boxplot.stats()` to return a list containing a vector of the data points which lie beyond the extremes of the whiskers of the boxplot.
> Hint: Use `scale_color_manual()` to mannually define the color scale.
```{r}
vec_outliers <- boxplot.stats(half_hourly_fluxes$GPP_NT_VUT_REF)$out
plot_data <- half_hourly_fluxes |>
mutate(outlier = GPP_NT_VUT_REF %in% vec_outliers)
plot_data |>
ggplot(aes(x = SW_IN_F, y = GPP_NT_VUT_REF, color = outlier)) +
geom_point() +
scale_color_manual("Outlier?", # Set title of legend
values = c("black", "red"), # Highlight in red
labels = c("No", "Yes") # Add labels to the legend
) +
labs(x = expression(paste("Shortwave radiation (W m"^-2, ")")),
y = expression(paste("GPP (", mu,"mol CO"[2], " m"^-2, "s"^-1, ")"))) +
theme_classic()
```
Now, we want to "control" for the influence of shortwave radiation on GPP and define outliers with respect to the distribution of *residuals* of the linear regression between the two variables. Relax the definition of what is considered an outlier by setting adjusting their definition to falling outside $( Q_1 - 5 (Q_3 - Q_1)$ to $Q_3 + 5 (Q_3 - Q_1)$. Again, plot `GPP_NT_VUT_REF` versus shortwave radiation and highlight outliers in red.
> Hint: Fit the linear regression model as `lm(GPP_NT_VUT_REF ~ SW_IN_F, data = half_hourly_fluxes)` and obtain the residuals from the object returned by the `lm()` function (see 'Value' in its help page).
> Hint: The output of `boxplot.stats(x)` is a list, containing an element `out`. `out` is a named vector of the oulier values with names referring to the row numbers of `x`. Use `as.integer(names(boxplot.stats(x)$out))` to get row numbers.
```{r}
residuals <- lm(GPP_NT_VUT_REF ~ SW_IN_F, data = half_hourly_fluxes)$residuals
# # unclear why this doesn't work:
# vec_outliers <- boxplot.stats(residuals, coef = 5)$out
# plot_data <- half_hourly_fluxes |>
# mutate(outlier = GPP_NT_VUT_REF %in% vec_outliers)
# this works:
rowindex_outliers <- as.integer(names(boxplot.stats(residuals, coef = 5)$out))
plot_data <- half_hourly_fluxes |>
mutate(rowindex = dplyr::row_number()) |>
mutate(outlier = rowindex %in% rowindex_outliers)
plot_data |>
ggplot(aes(x = SW_IN_F, y = GPP_NT_VUT_REF, color = outlier)) +
geom_point() +
scale_color_manual("Outlier?", # Set title of legend
values = c("black", "red"), # Highlight in red
labels = c("No", "Yes") # Add labels to the legend
) +
labs(x = expression(paste("Shortwave radiation (W m"^-2, ")")),
y = expression(paste("GPP (", mu,"mol CO"[2], " m"^-2, "s"^-1, ")"))) +
theme_classic()
```
**What do we see in this plot?** We see that the red points, the outliers, fall outside the main point cloud of green points. The distribution of these outliers seems without systematic pattern or deviation. Nonetheless, it is good practice to go a step further and look at these data points in detail to find out whether they should be removed or not in your analysis. In later Chapters you will learn more on what disproportionate role outliers can play and how they may affect your statistical model and analysis.
### Visualising diurnal and seasonal cycles of GPP {-}
As explored in the previous Chapter's exercises, GPP varies over diurnal and seasonal cycles. Create a publication-ready figure that visualises the mean diurnal cycle of GPP for each day-of-year (mean across multiple years). Make sure that the figure is properly labelled, and legible for readers with a color vision deficiency.
> Hint: To get the diurnal and seasonal cycles, summarise the half-hourly data by the hour of the day and the day of the year simultaneously using multiple grouping variables within `group_by()` and calculate mean values for GPP for each group.
> Hint: Chose an appropriate visualisation that maps the hour-of-day to the x-axis and the day-of-year to the y-axis.
```{r message=FALSE, warning=FALSE}
# Aggregate to hours-in-day for each day-in-year
fluxes_per_hod_doy <- half_hourly_fluxes |> # df from previous exercise
dplyr::mutate(
hour_day = lubridate::hour(TIMESTAMP_START), # hour of the day
day_year = lubridate::yday(TIMESTAMP_START)) |> # day of the year
dplyr::group_by(hour_day, day_year) |> # multiple grouping
dplyr::summarise(gpp = mean(GPP_NT_VUT_REF))
# Publication-ready raster plot
fluxes_per_hod_doy |>
# Specify aesthetics
ggplot(aes(x = hour_day,
y = day_year,
fill = gpp)) + # fill color of the raster
geom_raster() +
# Use a color scale that works also for color-blind people
scale_fill_viridis_c(option = "magma") +
# adjust the aspect ratio of the plotting region
coord_fixed(ratio = 0.18) +
# labels of each mapping axis, \n is a line break
labs(title = "Gross primary production",
subtitle = "Diurnal and seasonal cycle",
x = "Hour of day",
y = "Day of year",
fill = expression(paste(mu,"mol CO"[2], " m"^-2, "s"^-1))) +
# avoid having a small padding from the lowest values to the end of axes
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0))
```
Beautiful, isn't it? We nicely see that on practically all days of the year we have the diurnal cycle of GPP which follows the sun's cycle. And throughout a year, we see a rapid increase in GPP in the spring when all trees put out their leaves at once. After a highly productive summer, temperatures drop, sensescence kicks in and GPP gradually drops into its winter low.
### Trend in carbon dioxide concentrations {-}
This exercise explores the longest available atmospheric CO$_2$ record, obtained at the Mauna Loa observatory in Hawaii. Atmospheric CO$_2$ in the northern hemisphere is characterised by seasonal swings, caused by the seasonal course of CO$_2$ uptake and release by the terrestrial biosphere. We've explored the seasonality of the CO$_2$ uptake measured at one site (in Switzerland) extensively in this an previous chapters. Your task here is to calculate and visualise the long-term trend of CO$_2$. Follow these steps:
- Download and read the monthly mean CO2$_2$ data as a CSV file from [here](https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_mlo.csv) and read it into R.
```{r warnings=FALSE, message=FALSE}
# download the file
download.file(
"https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_mlo.csv",
"data/co2_mm_mlo.csv"
)
# read in the data
ml_co2 <- readr::read_csv(
"data/co2_mm_mlo.csv",
skip = 40
)
# wrangle the data
# interpret missing values
# rename to avoid white space in column name
ml_co2 <- ml_co2 |>
dplyr::mutate(dplyr::across(dplyr::where(is.numeric),
~dplyr::na_if(., -9.99))) |>
dplyr::mutate(dplyr::across(dplyr::where(is.numeric),
~dplyr::na_if(., -0.99))) |>
dplyr::mutate(dplyr::across(dplyr::where(is.numeric),
~dplyr::na_if(., -1))) |>
dplyr::rename(year_dec = `decimal date`)
```
- Make a simple graph for visualizing the monthly CO$_2$ time series.
```{r}
ml_co2 |>
ggplot2::ggplot() +
ggplot2::geom_line(aes(year_dec, average))
```
- Write a function that computes a 12-month running mean of the CO$_2$ time series. The running mean for month $m$ should consider values of $m-5$ to $m+6$. Define arguments for the function that let the user specify the width of the running mean "box" (i.e., setting the $5$ and $6$ to any other integer of choice)
```{r}
# Write the running mean function
# variables are defined as:
# Input vector (vec)
# Number of elements to the left (left)
# Number of elements to the right (right)
running_mean <- function(
vec,
left,
right
) {
# Create an empty vector of the same length as the input data
vec_out <- rep(NA, length(vec))
# Loop over each position in the vector
for (idx in (left+1):(length(vec)-right)){
# Define start and end of the box to average over
startbox <- idx - left
endbox <- idx + right
vec_out[idx] <- mean(vec[startbox:endbox], na.rm = TRUE)
}
return(vec_out)
}
ml_co2 <- ml_co2 |>
mutate(
average_12m = running_mean(average, left = 5, right = 6)
)
```
- Make a publication-ready figure that shows the monthly and the 12-month running mean time series of the CO$_2$ record.
> Hint: To automatically render the time axis with ggplot, you can create a time object by combining the year and month columns: `lubridate::ymd(paste(as.character(year), "-", as.character(month), "-15"))`
```{r}
# create a date object for nice plotting
plot_data <- ml_co2 |>
mutate(
date = lubridate::ymd(
paste(as.character(year),
"-",
as.character(month), "-15") # centering monthly mean on the 15th of each month
)
)
plot_data |>
ggplot() +
# monthly means
geom_line(
aes(
date,
average,
color = "Monthly mean"
)
) +
# running mean
geom_line(
aes(
date,
average_12m,
color = "12-month running mean"
)
) +
# Style the plot
theme_classic() +
theme(
legend.position = c(0.25, 0.75) # Move legend into the plot
) +
scale_color_manual(
"", # Omit legend title
values = c("tomato", "black"),
labels = c("12-month running mean", "Monthly mean")
) +
labs(
title = expression(
paste("Atmospheric CO"[2],
" concentrations on Manua Lao, Hawaii")
),
y = expression(paste("CO"[2], " (ppm)")),
x = "Year"
)
```
<!-- ### Telling a story from data {-} -->
<!-- *No solutions provided because part of the final report.* -->
## Data Variety
### Files and file formats {-}
#### Reading and writing human readable files {.unnumbered}
While **not leaving your R session**, download and open the files at the following locations:
> The below code shows how to read in the different demo data sets (CSV files). You will note that they all need separate settings, and that a given file extension isn't necessarily a reflection of the content the file. Inspection of your read in data is therefore key.
```{r}
# read in the first demo
demo_01 <- read.table(
"https://raw.githubusercontent.com/geco-bern/agds/main/data/demo_1.csv",
sep = ",",
header = TRUE
)
# read in second demo
demo_02 <- read.table(
"https://raw.githubusercontent.com/geco-bern/agds/main/data/demo_2.csv",
sep = " ",
header = TRUE
)
demo_03 <- read.table(
"https://raw.githubusercontent.com/geco-bern/agds/main/data/demo_3.csv",
sep = ";",
comment.char = "|",
header = TRUE,
)
```
All the demo data sets are equal, except for their formatting. We can test if the content is identical by using the `identical()` function in R.
```{r}
# compare 1 with 2
identical(demo_01, demo_02)
# compare 2 with 3
identical(demo_02, demo_03)
# Given transitive properties, demo_01 is identical to demo_03
```
Once loaded into your R environment, combine and save all data as a **temporary** `CSV` file. Read in the new temporary `CSV` file, and save it as a `JSON` file in your current working directory.
> You can combine the three datasets using the {dplyr} `bind_rows()` function.
```{r}
# combining all demo datasets
demo_all <- dplyr::bind_rows(demo_01, demo_02, demo_03)
# writing the data to a temporary CSV file
write.table(
demo_all,
file = file.path(tempdir(), "tmp_csv_file.csv"),
col.names = TRUE,
row.names = FALSE,
sep = ","
)
# or...
write.csv(
demo_all,
file.path(tempdir(), "tmp_csv_file.csv"),
row.names = FALSE
)
# read in the previous CSV file
demo_all_new <-read.table(
file.path(tempdir(), "tmp_csv_file.csv"),
header = TRUE,
sep = ","
)
# writing the data to a JSON file
jsonlite::write_json(demo_all_new, path = "./my_json_file.json")
```
#### Reading and writing binary files {.unnumbered}
Download and open the following file:
``` bash
https://raw.githubusercontent.com/geco-bern/agds/main/data/demo_data.nc
```
1. What file format are we dealing with?
It is a NetCDF file (ending `.nc`, see Table in Chapter \@ref(datavariety))
2. What library would you use to read this kind of data?
Different libraries are available, including {terra}, {raster} (the predecessor of {terra}), and {ncdf4}, see Table in Chapter \@ref(datavariety).
3. What does this file contain?
In R:
```{r eval=FALSE}
# read unknown netcdf file using the {terra} library
library(terra)
unknown_netcdf <- terra::rast(
"https://raw.githubusercontent.com/geco-bern/agds/main/data/demo_data.nc"
)
# print the meta-data by calling the variable
unknown_netcdf
# visually plot the data
terra::plot(unknown_netcdf)
```
From your terminal (when the file is located in the current working directory:
``` bash
ncdump -h demo_data.nc
```
When printing the object in R, we get: `varname : t2m (2 metre temperature)`.
4. Write this file to disk in a different geospatial format you desire (use the R documentation of the library used to read the file and the chapter information).
```{r eval=FALSE}
# write the data as a geotiff (other options are possible as well in writeRaster)
terra::writeRaster(
unknown_netcdf,
filename = "./test.tif",
overwrite = TRUE
)
```
5. Download and open the following file: `https://raw.githubusercontent.com/geco-bern/agds/main/data/demo_data.tif`. Does this data seem familiar, and how can you tell? What are your conclusions?
```{r eval=FALSE}
# read unknown tif file using the {terra} library
library(terra)
unknown_tif <- terra::rast(
"https://raw.githubusercontent.com/geco-bern/agds/main/data/demo_data.tif"
)
# print the meta-data by calling the variable
unknown_tif
# visually plot the data
terra::plot(unknown_tif)
# Are they exactly the same
terra::plot(unknown_tif - unknown_netcdf)
# or...
identical(unknown_netcdf, unknown_tif)
```
Looks similar to the NetCDF data, but temperature appears to be given in Celsius in the GeoTIFF file and in Kelvin in the NetCDF file.
### API Use
#### Get
1. We can get the total sand content using the tutorial using new coodinates outlining Switzerland.
```{r}
# set API URL endpoint
# for the total sand content
url <- "https://thredds.daac.ornl.gov/thredds/ncss/ornldaac/1247/T_SAND.nc4"
# formulate query to pass to httr
query <- list(
"var" = "T_SAND",
"south" = 45.5,
"west" = 5.9,
"east" = 10.7,
"north" = 48,
"disableProjSubset" = "on",
"horizStride" = 1,
"accept" = "netcdf4"
)
# download data using the
# API endpoint and query data
status <- httr::GET(
url = url,
query = query,
httr::write_disk(
path = file.path(tempdir(), "T_SAND.nc"),
overwrite = TRUE
)
)
# to visualize the data
# we need to load the {terra}
# library
sand <- terra::rast(file.path(tempdir(), "T_SAND.nc"))
terra::plot(sand)
```
2. Consulting the original data pages or the package help files one can determine that the parameter "T_SAND" needs to be replaced by "T_SILT" in both the URL and the query.
```{r}
# set API URL endpoint
# for the total sand content
url <- "https://thredds.daac.ornl.gov/thredds/ncss/ornldaac/1247/T_SILT.nc4"
# formulate query to pass to httr
query <- list(
"var" = "T_SILT",
"south" = 45.5,
"west" = 5.9,
"east" = 10.7,
"north" = 48,
"disableProjSubset" = "on",
"horizStride" = 1,
"accept" = "netcdf4"
)
# download data using the
# API endpoint and query data
status <- httr::GET(
url = url,
query = query,
httr::write_disk(
path = file.path(tempdir(), "T_SAND.nc"),
overwrite = TRUE
)
)
# to visualize the data
# we need to load the {terra}
# library
sand <- terra::rast(file.path(tempdir(), "T_SAND.nc"))
terra::plot(sand)
```
#### Dedicated libraries
2. Using the {hwsdr} package this simplifies to:
```{r}
# Download a soil fraction map
# of sand for a given bounding box
hwsdr::ws_subset(
location = c(45.5, 5.9, 48, 10.7),
param = "T_SAND",
path = tempdir()
)
# to visualize the data
# we need to load the {terra}
# library
sand <- terra::rast(file.path(tempdir(), "T_SAND.nc"))
terra::plot(sand)
```
2. You can easily list all {MODISTools} products using:
```{r}
# list all products
products <- MODISTools::mt_products()
# We count
nrow(products)
```
3. You can use the {MODISTools} package to easily query the land cover data. Use the `MODISTools::mt_products()` and `MODISTools::mt_bands()` functions to determine products to use (i.e. MCD12Q1). Note that MODISTools does not use a bounding box but km left/right and top/bottom - an approximation is therefore good enough.