-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path03-data_wrangling.Rmd
843 lines (574 loc) · 66 KB
/
03-data_wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
# Data wrangling {#datawrangling}
**Chapter lead author: Benjamin Stocker**
## Learning objectives
In this chapter you will learn how to manipluate and transform data, a curcial part of the data science workflow.
You will learn how to:
- read and transform tabulated data
- understand the 'tidy' data concept
- select variables
- Aggregate data
- handle bad and/or missing data
## Tutorial
Exploratory data analysis - the transformation, visualization, and modelling of data - is the central part of any (geo-) data science workflow and typically takes up a majority of the time we spend on a research project. The transformation of data often turns out to be particularly (and often surprisingly) time-demanding. Therefore, it is key to master typical steps of data transformation, and to implement them in a transparent fashion and efficiently - both in terms of robustness against coding errors ("bugs") and in terms of code execution speed.
We refer to data *wrangling* here to encompass the steps for preparing the data set *prior* to modelling - including, the combination of variables from different data sources, the removal of bad data, and the aggregation of data to the desired resolution or granularity (e.g., averaging over all time steps in a day, or over all replicates in a sample).
In contrast, *pre-processing* refers to the additional steps that are either required by the the specific machine learning algorithm used with the data (e.g., centering and scaling for K-Nearest Neighbors or Neural Networks), the gap-filling of variables, or the transformation of variables guided by the resulting improvement of the predictive power of the machine learning model. Pre-processing is part of the modelling workflow and includes all steps that apply transformations that use parameters derived from the data. We will introduce and discuss data pre-processing in Chapter \@ref(supervisedmli).
### Example data
The example data used in this chapter are parallel time series of (gaseous) CO$_2$ and water vapor exchange fluxes between the vegetation and the atmosphere, along with various meteorological variables measured in parallel. Quasi-continuous measurements of temporally changing gas exchange fluxes are obtained with the eddy covariance technique which relies on the parallel quantification of vertical wind speeds and gas concentrations.
The data are provided at half-hourly resolution for the site [CH-Lae](https://www.swissfluxnet.ethz.ch/index.php/sites/ch-lae-laegeren/site-info-ch-lae/), located on the south slope of the Lägern mountain on the Swiss Plateau at 689 m a.s.l. in a mixed forest with a distinct seasonal course of active green leaves (a substantial portion of the trees in the measured forest are deciduous). The dataset is generated and formatted following standard protocols ([FLUXNET2015](https://fluxnet.org//data/fluxnet2015-dataset/)). For more information of the variables in the dataset, see the [FLUXNET2015 website](http://fluxnet.fluxdata.org/data/fluxnet2015-dataset/) and [Pastorello et al., 2020](https://www.nature.com/articles/s41597-020-0534-3) for a comprehensive documentation of variable definitions and methods.
For our demonstrations, the following variables are the most relevant:
- `TIMESTAMP_START`: Hour and day of the start of the measurement period for which the respective row's data are representative. Provided in a format of "YYYYMMDDhhmm".
- `TIMESTAMP_END`: Hour and day of the end of the measurement period for which the respective row's data are representative. Provided in a format of "YYYYMMDDhhmm".
- `TA_*` (°C): Air temperature.
- `SW_IN_*` (W m$^{-2}$): Shortwave incoming radiation
- `LW_IN_*` (W m$^{-2}$): Longwave incoming radiation
- `VPD_*` (hPa): Vapor pressure deficit (the difference between actual and saturation water vapor pressure)
- `PA_*` (kPa): Atmospheric pressure
- `P_*` (mm): Precipitation
- `WS_*` (m $^{-1}$): Wind speed
- `SWC_*` (%): Volumetric soil water content
- `GPP_*` ($\mu$mol CO$_2$ m$^{-2}$ s$^{-1}$): Gross primary production (the ecosystem-level gross CO$_2$ uptake flux driven by photosynthesis)
- `*_QC`: Quality control information for the variable `*`. Important for us: `NEE_*_QC` is the quality control information for the net ecosystem CO$_2$ exchange flux (`NEE_*`) and for GPP derived from the corresponding NEE estimate (`GPP_*`). 0 = measured, 1 = good quality gap-filled, 2 = medium, 3 = poor.
Suffixes `_*` indicate that multiple estimates for respective variables are available and distinguished by different suffixes. For example, variables `TA_*` contain the same information, but are derived with slightly different assumptions and gap-filling techniques. The meanings of suffixes are described in [Pastorello et al., 2020](https://www.nature.com/articles/s41597-020-0534-3).
### Tidyverse
The tidyverse is a collection of R packages and functions that share a common design philosophy, enabling a particularly efficient implementation of transformation steps on tabular data. The most important data and function design principle of the tidyverse is that each function takes a data frame as its first argument and returns a data frame as its output.
<!-- Working on a (geo-) data science project, we want to efficiently progress through the multiple circles of exploratory data analysis. The following aspects are particularly important for fast and error-free progression. First, the programming language should be conducive of fast, intuitive, and error-free coding. Second, code that we have written once should be legible by others and by our future selves (even after a long holiday or after the manuscript has been seen by all reviewers months after we finished the analysis for our initial submission). It's the daily reality of us (geo-) data scientist that the code we write is harder to read than text in a newspaper, and that it will have bugs. There is (so far) no magic solution to this. But the R tidyverse comes close. -->
From this design principles, even the most convoluted code and implementation of data transformation steps fall into place and fast and error-free progression through exploratory data analysis is facilitated. Therefore, you will be introduced to the R {tidyverse} here and we heavily rely on this *dialect* of the R language throughout the remainder of this course.
### Reading tabular data
Tabular data are organised in rows and columns. R data frames are tabular data. As introduced in Chapter \@ref(gettingstarted), each column can be regarded as a vector of a certain type. Each row contains the same number of columns and each column contains the same type of values (for example numeric, or characters). Each row can be regarded as a separate instance or data record - for example a record of simultaneously taken measurements of different variables, along with some attributes and meta information (e.g., the date). In Chapter \@ref(datavariety), you will be introduced to other types of data.
The most common format for tabular data are CSV (comma-separated-values), typically indicated by the file name suffix `.csv`. CSV is a text-based file format, readable across platforms and does not rely on proprietary software (as opposed to, for example, `.xlsx`). The first row in a CSV file typically specifies the name of the variable provided in the respective column.
Let's get started with working with our example data set and read it into R, as the variable `half_hourly_fluxes`. Note that the naming of variables can be important for keeping code legible. Chose intuitively understandable names that describe what the object represents (as done here).
To import the data into the R environment, we use the function `read_csv()` from the {readr} package (part of tidyverse). In other R code, you will also encounter the base R `read.csv()` function. However, `read_csv()` is much faster and reads data into a tidyverse-data frame (a *tibble*) which has some useful additional characteristics, on top of a common R data frame. For example, *tibbles* generate a nicely readable output when printing the object as is done below.
```{r results=TRUE, message=FALSE, warning=FALSE}
half_hourly_fluxes <- readr::read_csv("./data/FLX_CH-Lae_FLUXNET2015_FULLSET_HH_2004-2006.csv")
half_hourly_fluxes
```
> To reproduce this code chunk, you can download the file `FLX_CH-Lae_FLUXNET2015_FULLSET_HH_2004-2006.csv` from [here](https://raw.githubusercontent.com/geco-bern/agds/main/data/FLX_CH-Lae_FLUXNET2015_FULLSET_HH_2004-2006.csv) and read it from the local path where the file is stored on your machine. All data files used in this tutorials are stored [here](https://github.com/geco-bern/agds/tree/main/data).
Since the file is properly formatted, with variable names given in the first line of the file, the function `read_csv()` identifies them correctly as column names and interprets values in each column as values of a consistent type (as numeric `<dbl>`). The file is also automatically machine-readable because it has no merged cells and only one value per cell.
### Variable selection
For our further data exploration, we will reduce the data frame we are working with and select a reduced set of variables. Reducing the dataset can have the advantage of speeding up further processing steps, especially when the data are large. For the further steps in this chapter we will now subset our original data. We select the following variants of variables described above, plus some additional variables (further information in [Pastorello et al., 2020](https://www.nature.com/articles/s41597-020-0534-3)):
- All variables with names starting with `TIMESTAMP`)
- All meteorological variables derived following the "final gap-filled method", as indicated with names ending with `_F`.
- GPP estimates that are based on the nighttime decomposition method, using the "most representative" of different gap-filling versions, after having applied the variable u-star filtering method (`GPP_NT_VUT_REF`) and the corresponding quality control information (`NEE_VUT_REF_QC`)
- Soil water measured at different depths (variables starting with `SWC_F_MDS_`)
- Do not use any radiation variables derived with the "JSBACH" algorithm (not with a name that contains the string `JSB`)
- Flag indicating whether a time step is at night (`NIGHT`)
This is implemented by:
```{r}
half_hourly_fluxes <- select(
half_hourly_fluxes,
starts_with("TIMESTAMP"),
ends_with("_F"),
GPP_NT_VUT_REF,
NEE_VUT_REF_QC,
starts_with("SWC_F_MDS_"),
-contains("JSB"),
NIGHT
)
```
This reduces our dataset from 235 available variables to 20 variables. As you can see, `select()` is a powerful tool to apply multiple selection criteria on your data frame in one step. It takes many functions that make filtering the columns easier. For example, criteria can be formulated based on the variable names with `starts_with()`, `ends_with()`, `contains()`, `matches()`, etc. Using these functions within `select()` can help if several column names start with the same characters or contain the same pattern and all need to be selected. If a minus (`-`) is added in front of a column name or one of the mentioned functions within `select()`, then R will not include the stated column(s). Note that the selection criteria are evaluated in the order we write them in the `select()` function call. You can find the complete reference for selecting variables [here](https://dplyr.tidyverse.org/reference/select.html).
### Time objects
The automatic interpretation of the variables `TIMESTAMP_START` and `TIMESTAMP_END` by the function `read_csv()` is not optimal:
```{r}
class(half_hourly_fluxes$TIMESTAMP_START[[1]])
as.character(half_hourly_fluxes$TIMESTAMP_START[[1]])
```
As we can see, it is considered by R as a numeric variable with 12 digits ("double-precision", occupying 64 bits in computer memory). After printing the variable as a string, we can guess that the format is: `YYYYMMDDhhmm`. The {lubridate} package is designed to facilitate processing date and time objects. Knowing the format of the timestamp variables in our dataset, we can use `ymd_hm()` to convert them to actual date-time objects.
```{r, message=FALSE}
dates <- ymd_hm(half_hourly_fluxes$TIMESTAMP_START)
dates[1]
```
Working with such date-time objects facilitates typical operations on time series. For example, adding one day can be done by:
```{r}
nextday <- dates + days(1)
nextday[1]
```
The following returns the month of each date object:
```{r}
month(dates[1])
```
The number 1 stands for the month of the year, i.e., January. You can find more information on formatting dates and time within the {tidyverse} [here](https://r4ds.had.co.nz/dates-and-times.html), and a complete reference of the {lubridate} package is available [here](https://lubridate.tidyverse.org/).
### Variable (re-) definition
Since `read_csv()` did not interpret the `TIMESTAMP_*` variables as desired, we may convert the entire column in the data frame into a date-time object. In base-R, we would do this by:
```{r eval=FALSE}
half_hourly_fluxes$TIMESTAMP_START <- ymd_hm(half_hourly_fluxes$TIMESTAMP_START)
```
Modifying existing or creating new variables (columns) in a data frame is done in the {tidyverse} using the function `mutate()`. The equivalent statement is:
```{r eval=FALSE}
half_hourly_fluxes <- mutate(
half_hourly_fluxes,
TIMESTAMP_START = ymd_hm(TIMESTAMP_START)
)
```
> Note: Avoid using whitespaces ('Leerzeichen') to name columns in a dataframe because it can cause troubles down the line. Instead, use `_` to separate words in one name.
In the code chunk above, the function `mutate()` is from the tidyverse package {dplyr}. It takes a dataframe as its first argument (here `half_hourly_fluxes`) and returns a dataframe as its output. You will encounter an alternative, but equivalent, syntax in the following form:
```{r, eval=FALSE}
half_hourly_fluxes <- half_hourly_fluxes |>
mutate(TIMESTAMP_START = ymd_hm(TIMESTAMP_START))
```
Here, the **pipe** operator `|>` is used. It "pipes" the object evaluated on its left side into the function on its right side, where the object takes the place of (but is not spelled out as) the *first argument* of that function. Using the pipe operator can have the advantage of facilitating the separation, removal, inserting, or re-arranging of individual transformation steps. Arguably, it facilitates reading code, especially for complex data transformation workflows. Therefore, you will encounter the pipe operator frequently throughout the remainder of this course.
> Note: The pipe operator is so popular that has been recently included in the latest versions of base R (version 4.1.0 and beyond). This is the `|>` pipe we just introduced. Nevertheless, you may encounter the `%>%` operator, which is the original pipe from the {magrittr} package (part of the {tidyverse}).
> Note: If you cannot update R and have a version older than 4.1.0, just use the `magrittr` pipe `%>%` throughout. This can happen if you have an older Macbook that can't operate the latest Operating System version.
Mutating both our timestamp variables could be written as `mutate(TIMESTAMP_START = ymd_hm(TIMESTAMP_START), TIMESTAMP_END = ymd_hm(TIMESTAMP_END))`. Sometimes, such multiple-variable mutate statements can get quite long. A handy short version of this can be implemented using `across()`:
```{r}
half_hourly_fluxes <- half_hourly_fluxes |>
mutate(across(starts_with("TIMESTAMP_"), ymd_hm))
```
We will encounter more ways to use mutate later in this tutorial. A complete reference to `mutate()` is available [here](https://r4ds.had.co.nz/transform.html#add-new-variables-with-mutate).
If you only want to change the name of a variable, but not modify its values, you can do so with the {dplyr} function `rename()`.
### Axes of variation
Tabular data are two-dimensional (rows $\times$ columns), but not all two-dimensional data are tabular. For example, raster data are a two-dimensional array of data (a matrix) representing variables on an evenly spaced grid, for example pixels in remotely sensed imagery. For example the `volcano` data (provided as an example dataset in R) is a matrix - each column contains the same variable, and no variable names are provided.
```{r}
volcano[1:5, 1:5]
```
In the `volcano` dataset, rows and columns represent different geographic positions in latitude and longitude, respectively. The `volcano` data are not tabular data. Another typical example for non-tabular data are climate model outputs. They are typically given as *arrays* with more than two dimensions. Typically, this is longitude, latitude, and time, and sometimes a vertical dimension representing, for example, elevation. Such data are multi-dimensional and, as such, not tabular.
Tabular data, although formatted in two dimensions by rows and columns, may represent data that varies along multiple axes. Most environmental data are *structured*, that is, values of "nearby" observations tend to be more similar than values of "distant" observations. Here, "nearby" and "distant" may refer to a spatial distance, but not necessarily so. Structure in data arises from similarity of the subjects generating the data (e.g., evapotranspiration over two croplands may be more similar than evapotranspiration over a forest), or from temporal proximity. In biological data, there may be a genetic structure arising from evolutionary relatedness ([Roberts et al., 2016](https://onlinelibrary.wiley.com/doi/10.1111/ecog.02881)). Note also that temporal proximity is more complex than than being governed by a single dimension - time. In environmental data, time is often expressed through periodically varying conditions (the diurnal and seasonal cycles). It's often critical to understand and account for the structure in data when analysing it and using it for model fitting. Challenges are posed when structure is not apparent or not known.
Note also that some structures are *hierarchical*. For example, data may be structured by postal codes within cantons; or by hours within a day within a year. Biological data may be generated by species within *genera* within *families*. Data from experiments is typically structured as *samples* within *treatments*. You see, structure in data is rather the rule than the exception.
Our example data contains values recorded at each half-hourly time interval over the course of eleven years (check by `nrow(half_hourly_fluxes)/(2*24*365)`). The data are recorded at a site, located in the temperate climate zone, where solar radiation and therefore also other meteorological variables and ecosystem fluxes vary substantially over the course of a day and over the course of a year. Although not explicitly separated, the date-time object thus encodes information along multiple *axes of variation* in the data. For example, over the course of one day (`2*24` rows in our data), the shortwave incoming radiation `SW_IN_F` varies over a typical diurnal cycle:
```{r}
plot(
half_hourly_fluxes[1:(2*24),]$TIMESTAMP_START,
half_hourly_fluxes[1:(2*24),]$SW_IN_F,
type = "l"
)
```
> Note: `plot()` is the very basic of plotting data. In Chapter \@ref(datavis), you will get introduced to additional methods for visualising data. The argument `type = "l"` indicates that we want a line plot, rather than points.
Over the course of an entire year, shortwave incoming radiation varies with the seasons, peaking in summer:
```{r}
plot(
half_hourly_fluxes[1:(365*2*24),]$TIMESTAMP_START,
half_hourly_fluxes[1:(365*2*24),]$SW_IN_F,
type = "l"
)
```
All data frames have two dimensions, rows and columns. Our data frame is organised along half-hourly time steps in rows. As described above, these time steps belong to different days, months, and years, although these "axes of variation" are not reflected by the structure of the data frame object itself and we do not have columns that indicate the day, month or year of each half-hourly time step. This would be redundant information since the date-time objects of columns `TIMESTAMP_*` contain this information. However, for certain applications, it may be useful to separate information regarding these axes of variation more explicitly. For example by:
```{r}
half_hourly_fluxes |>
mutate(year = year(TIMESTAMP_START),
month = month(TIMESTAMP_START),
doy = yday(TIMESTAMP_START) # day of year
) |>
select(TIMESTAMP_START, TIMESTAMP_END, year, month, doy) # for displaying
```
Note that we used `mutate()` here to create a new variable (column) in the data frame, as opposed to above where we overwrote an existing variable with the same function.
### Tidy data {#tidydata}
Data comes in many forms and shapes. For example, Excel provides a playground for even the wildest layouts of information in some remotely tabular form and merged cells as we will see in the [Exercises](#exerciseswrangling). A data frame imposes a relatively strict formatting in named columns of equal length. But even data frames can come in various shapes - even if the information they contain is the same.
```{r echo=FALSE}
## this is hidden - used to create the objects printed in next chunk
## using the ts object 'co2'
co2_concentration <- tibble(co2_concentration = as.vector(co2),
year_dec = time(co2)
) |>
dplyr::mutate(month_dec = year_dec - as.integer(year_dec)) |>
dplyr::mutate(year = as.integer(year_dec - month_dec)) |>
dplyr::mutate(month = as.integer(month_dec * 12)) |>
dplyr::mutate(date = lubridate::ymd(paste0(as.character(year), as.character(month + 1), "-15"))) |>
dplyr::mutate(month = month(date, label = TRUE)) |>
dplyr::select(year, month, co2_concentration) |>
dplyr::slice(1:36)
co2_concentration_monthly <- co2_concentration |>
pivot_wider(names_from = month, values_from = co2_concentration)
co2_concentration_yearly <- co2_concentration |>
# mutate(year = as.character(year)) |>
pivot_wider(names_from = year, values_from = co2_concentration)
```
```{r}
co2_concentration
co2_concentration_monthly
co2_concentration_yearly
```
There are advantages for interoperability and ease of use when data frames come with consistent layouts, adhering to certain design principles. We have learned that in tabular data, each row contains the same number of columns and each column contains the same type of values (for example numeric, or characters). And that each row can be regarded as a separate instance of the same type, for example a record of simultaneously taken measurements, along with some attributes. Following these principles strictly leads to *tidy data*. In essence, quoting [Wickham and Grolemund (2017)](https://r4ds.had.co.nz/tidy-data.html), data are tidy if:
- Each variable has its own column.
- Each observation has its own row.
- Each value has its own cell.
```{r label="tidy", echo=FALSE, fig.cap="Rules for tidy data. Figure from [Wickham and Grolemund (2017)](https://r4ds.had.co.nz/tidy-data.html)", out.width="100%", fig.align='center'}
knitr::include_graphics("./figures/tidy_data.png")
```
The {tidyr} package provides powerful functions to make data tidy. In the examples above, `co2_concentration_monthly` and and `co2_concentration_yearly` are not tidy. In `co2_concentration_monthly`, the same variable (CO2 concentration) appears in multiple columns. Organising columns by months leads to a "wide" table format. We can convert it to a "long" format by:
```{r}
co2_concentration_monthly |>
pivot_longer(cols = 2:13, names_to = "month", values_to = "co2")
```
This corresponds to the format of `co2_concentration` and is tidy. A long format of data frames is required to visualise data using the plotting functions of the {ggplot2} package which will be introduced in Chapter \@ref(datavis).
Either way, for certain applications, it may be advantageous to work with a wide format. We can convert from a long to a wide format by:
```{r}
co2_concentration |>
pivot_wider(names_from = year, values_from = co2_concentration)
```
When seeking, for example, the average CO2 concentration for each month, you may be tempted to work with a wide data frame and treat it as a matrix to calculate a mean by rows. You can do so, but then, you leave the `tidyverse`. This will complicate your life. You'll learn how to perform tidy data aggregation below.
The concept of tidy data can even be taken further by understanding a "value" as any object type, e.g., a list or a data frame. This leads to a list or data frame "nested" within a data frame. You will learn more about this below.
### Aggregating data
Aggregating data refers to collapsing a larger set of values into a smaller set of values that are derived from the larger set. For example, we can aggregate over all $N$ rows in a data frame ($N\times M$), calculating the sum for each of the $M$ columns. This returns a data frame ($1 \times M$) with the same number of columns as the initial data frame, but only one row. Often, aggregations are done not across all rows but for rows within $G$ groups of rows. This yields a data frame ($G \times M$) with the number of rows corresponding to the number of groups.
Let's say we want to calculate the mean of half-hourly shortwave radiation within each day. We thus have $N$ half-hourly time steps in $G$ days. That is, to aggregate our half-hourly data to daily data by taking a mean. There are two pieces of information needed for an aggregation step: The factor (or "axis of variation"), here days, that groups a vector of values for collapsing it into a single value, and the function used for collapsing values, here, the `mean()` function. This function should take a vector as an argument and return a single value as an output. These two steps are implemented by the {dplyr} functions `group_by()` and `summarise()`. The entire aggregation workflow is implemented by the following code:
```{r}
daily_fluxes <- half_hourly_fluxes |>
mutate(date = as_date(TIMESTAMP_START)) |> # converts the ymd_hm-formatted date-time object to a date-only object (ymd)
group_by(date) |>
summarise(SW_IN_F = mean(SW_IN_F))
```
The seasonal course can now be more clearly be visualized with the data aggregated to daily values.
```{r}
plot(daily_fluxes[1:365,]$date, daily_fluxes[1:365,]$SW_IN_F, type = "l")
```
We can also apply multiple aggregation functions to different variables simultaneously. In the example below, we aggregate half-hourly data to daily data by...
- taking the daily mean GPP
- counting the number of half-hourly data points by day
- counting the number of measured (not gap-filled) data points
- taking the mean shortwave radiation
Finally, we calculate the fraction of measured underlying half-hourly data from which the aggregation is calculated and we save the daily data frame as a CSV file for later use.
```{r}
daily_fluxes <- half_hourly_fluxes |>
mutate(date = as_date(TIMESTAMP_START)) |> # converts time object to a date object
group_by(date) |>
summarise(GPP_NT_VUT_REF = mean(GPP_NT_VUT_REF, na.rm = TRUE),
n_datapoints = n(), # counts the number of observations per day
n_measured = sum(NEE_VUT_REF_QC == 0), # counts the number of actually measured data (excluding gap-filled and poor quality data)
SW_IN_F = mean(SW_IN_F, na.rm = TRUE), # we will use this later
.groups = 'drop' # to un-group the resulting data frame
) |>
mutate(f_measured = n_measured / n_datapoints) # calculate the fraction of measured values over total observations
write_csv(daily_fluxes, file = "data/daily_fluxes.csv")
daily_fluxes
```
Note that above, we specified the argument `.groups = 'drop'` to "un-group" the resulting data frame. The same can also be achieved by a separate function call `ungroup()` after the `summarise()` step.
More info on how to group values using summarise functions [here](https://r4ds.had.co.nz/transform.html#grouped-summaries-with-summarise), or a summary on the inputs the function [group_by()](https://dplyr.tidyverse.org/reference/group_by.html) and [summarise()](https://dplyr.tidyverse.org/reference/summarise.html) take.
Aggregating is related to *nesting* performed by the {tidyr} function `nest()`:
```{r message=FALSE}
half_hourly_fluxes |>
mutate(date = as_date(TIMESTAMP_START)) |>
group_by(date) |>
nest()
```
Here, the data frame has one row per date and therefore the same number of rows as the data frame `daily_fluxes`, but the data itself is not reduced by a summarising function. Instead, the data are kept at the half-hourly level, but it's nested inside the new column `data`, which now contains a list of half-hourly data frames for each day. This is just a brief perspective of what nesting is about. More is explained in Section \@ref(extramaterialwrangling). More comprehensive tutorials on nesting and functional programming are provided in [Altman, Behrman and Wickham (2021)](https://dcl-prog.stanford.edu/) or in [Wickham and Grolemund (2017), Chapter 21](https://r4ds.had.co.nz/iteration.html).
### Data cleaning
Data cleaning is often a time-consuming task and decisions taken during data cleaning may be critical for analyses and modelling. In the following, we distinguish between cleaning formats, the identification (and removal) of "bad" data, and the gap-filling of missing or removed data. An excellent resource for further reading is the [Quartz Guide to Bad Data](https://github.com/Quartz/bad-data-guide) which provides an overview of how to deal with different types of bad data.
#### Cleaning formats
As a general principle, we want to have *machine readable* data. Key for achieving machine-readability is that a cell should only contain one value of one type. Hence, for example, character strings should be kept in separate columns (as separate variables) from numeric data. Character strings can impose particular challenges for achieving machine-readability. Typically, they encode categorical or ordinal information, but are prone to spelling inconsistencies or errors that undermine the ordering or categorization. Here are typical examples for challenges working with character strings and lessons for avoiding problems:
- Often, character strings encode the units of a measurement, and entries may be `c("kg m-2", "kg/m2", "Kg / m2", "1000 g m-2")` . They are all equivalent, but "the machine" treats them as non-identical. To clean such data, one may compile a lookup-table to identify equivalent (but not identical) strings. Much better is to specify a consistent treatment of units before data collection.
- Even if the data are clean and contain a consistently spelled categorical variable in the form of a character string, R doesn't necessarily treat it as categorical. For certain downstream steps of the workflow, it may be necessary to transform such a variable to one of type `factor`. For example, as entries of an unordered categorical variable, we have `unique(df$gender) = c("female", "male", "non-binary")`. To treat them as categorical and not just mere character strings, we would have to do:
```{r, eval=FALSE}
df <- df |> dplyr::mutate(gender = as.factor(gender))
```
- Character strings may encode ordinal information. For example, entries specify quality control information and are one of `c("good quality", "fair quality, "poor quality")`. A challenge could be that the spelling is inconsistent (`c("Good quality", "good quality", …)`). Using integers (positive natural numbers) instead of character strings avoids such challenges and enforces an order. The quality control variable `NEE_VUT_REF_QC` in our example dataset `half_hourly_fluxes` follows this approach:
```{r}
unique(half_hourly_fluxes$NEE_VUT_REF_QC)
```
- An entry like `>10 m` is not a friend of a data scientist. Here, we have three pieces of information: `>` as in "greater than", `10`, and `m` indicating the units. A machine-readable format would be obtained by creating separate columns for each piece of information. The `>` should be avoided already at the stage of recording the data. Here, we may have to find a solution for encoding it in a machine readable manner (see [Exercises](#exerciseswrangling)).
<!-- - Can you think of more such examples? (-\> Exercises) -->
String manipulations are usually required for cleaning data. The Section \@ref(#strings) below demonstrates some simple examples.
Note that a majority of machine learning algorithms and other statistical model types require all data to be numeric. Methods exist to convert categorical data into numeric data, as we will learn later. We re-visit data cleaning in the form of data *pre-processing* as part of the modelling workflow in Chapter \@ref(supervisedmli).
#### Bad data {#baddata}
Data may be "bad" for different reasons, including sensor error, human error, a data point representing a different population, or unsuitable measurement conditions. In this sense, data are "bad" if they don't represent what they are assumed by the user to represent. Its presence in analyses and modelling may undermine the model skill or even lead to spurious results. A goal of data cleaning typically is to remove bad data. But how to detect them? And how safe is it to remove them?
A diversity of processes may generate bad data and it is often not possible to formulate rules and criteria for their identification *a priori*. Therefore, an understanding of the data and the data generation processes is important for the identification and treatment of bad data. Often, such an understanding is gained by repeated exploratory data analysis cycles, involving the visualization, transformation, and analysis of the data.
Ideally, information about the quality of the data are provided as part of the dataset. Also other meta-information (e.g., sensor type, human recording the data, environmental conditions during the data collection) may be valuable for data cleaning purposes. In our example dataset, the column with suffices `_QC` provide such information (see 'Example data' section above) and an example for their use in data cleaning is given further below.
Bad data may come in the form of *outliers*, which are commonly defined based on their value with respect to the *distribution* of all values of the same variable in a dataset. Hence, their identification most commonly relies on quantifying their distance from the center of the variable's empirical distribution. The default `boxplot()` plotting function in R (which we will learn about more in Chapter \@ref(datavis)) shows the median (bold line in the center), the upper and lower quartiles (corresponding to the 25% and the 75% quantiles, often referred to as $Q_1$ and $Q_3$ , given by the upper and lower edge of the box plot) and the range of $( Q_1 - 1.5 (Q_3 - Q_1), Q_3 + 1.5 (Q_3 - Q_1))$. Any point outside this range is plotted by a circle and labeled an "outlier". However, this definition is very restrictive and may lead to a false labeling of outliers, in particular if they are drawn from a distribution with a fat tail or from asymmetrical distributions.
Outliers may also be identified via multivariate distributions. We will re-visit such methods later, in Chapter \@ref(regressionclassification). For certain applications, outliers or *anomalies* may be the target of the investigation, not the noise in the data. This has spurred the field of [*anomaly detection*](https://en.wikipedia.org/wiki/Anomaly_detection) which relies on machine learning algorithms for determining whether a value is anomalous, given a set of covariates.
Sensor error or algorithm error may generate spurious values, identified, for example when a continuous variable attains the numerically identical value with a spuriously high frequency.
```{r}
half_hourly_fluxes$GPP_NT_VUT_REF |>
table() |>
sort(decreasing = TRUE) |>
head()
```
The probability of a certain numeric value of a continuous variable to appear twice in a dataset is practically zero. Here, several values appear multiple times - the value `5.18422` even 32 times! This must be bad data.
Other processes may lead to spurious trends or *drift* in the data, for example caused by sensor degradation. Spurious *step changes* or *change points* in time series or in (multivariate) regressions may be related to the replacement or deplacement of the measuring device. Different methods and R libraries help identifying such cases (see for example [this](https://www.marinedatascience.co/blog/2019/09/28/comparison-of-change-point-detection-methods/) tutorial). Solutions have to be found for the remediation of such spurious patterns in the data on a case-by-case basis.
#### Handling missing data
The question about when data are "bad" and whether to remove it is often critical. Such decisions are important to keep track of and should be reported as transparently as possible in publications. In reality, where the data generation process may start in the field with actual human beings writing notes in a lab book, and where the human collecting the data is often not the same as the human analyzing the data or writing the paper, it's often more difficult to keep track of such decisions. As a general principle, it is advisable to design data records such that decisions made during the data collection process remain transparent throughout all stages of the workflow and that sufficient information be collected to enable later revisions of particularly critical decisions. In practice, this means that the removal of data and entire rows should be avoided and implemented only at the very last step if necessary (e.g., when passing the data into a model fitting function). Instead, information about whether data are bad should be kept in a separate, categorical, variable (a *quality control* variable, like `*_QC` variables in our example data `half_hourly_fluxes`).
Data may be missing for several reasons. Some yield random patterns of missing data, others not. In the latter case, we can speak of *informative missingness* ([Kuhn and Johnson, 2019](http://www.feat.engineering/)) and its information can be used for modelling. For categorical data, we may replace such data with `"none"` (instead of `NA`). Some machine learning algorithms (mainly tree-based methods, e.g., Random Forest) can handle missing values. However, when comparing the performance of alternative ML algorithms, they should be tested with the same data and removing missing data should be done beforehand.
Most machine learning algorithms require missing values to be removed. That is, if any of the cells in one row has a missing value, the entire cell gets removed. This generally leads to a loss of information contained in the remaining variables. Methods exist to *impute* missing values in order to avoid this information loss. However, the gain of data imputation has to be traded off against effects of associating the available variables with the imputed (knowingly wrong) values, and effects of *data leakage* have to be considered. Data imputation as part of the modelling process will be dealt with in Chapter \@ref(supervisedmli).
In our example dataset, some values of `SWC_F_MDS_*` are given as `-9999`.
```{r}
half_hourly_fluxes |>
select(TIMESTAMP_START, starts_with("SWC_F_MDS_")) |>
head()
```
When reading the documentation of this specific dataset, we learn that `-9999` is the code for *missing data*. The {dplyr} functions help us to clarify these missing values by mutating across all numeric variables and overwrite entries with `NA` if they hold a `-9999`.
```{r}
half_hourly_fluxes <- half_hourly_fluxes |>
mutate(across(where(is.numeric), ~na_if(., -9999)))
half_hourly_fluxes |>
select(TIMESTAMP_START, starts_with("SWC_F_MDS_")) |>
head()
```
This lets us visualise the data and its gaps with `vis_miss()` from the {visdat} package. Visualising missing data can be informative for making decisions about dropping rows with missing data versus removing predictors from the analysis (which would imply too much data removal).
```{r warning=FALSE, message=FALSE}
visdat::vis_miss(
half_hourly_fluxes |> slice(1:10000),
cluster = FALSE,
warn_large_data = FALSE
)
```
For many applications, we want to filter the data so that the values of particular variables satisfy certain conditions. The {dplyr} function used for such tasks is `filter()`. As argument, it takes the expressions that specify the criterion for filtering using logical operators (`>, >=, <, ==, !-, ...`, see Chapter \@ref(gettingstarted)). Multiple filtering criteria can be combined with logical (boolean) operators:
- `&`: logical AND
- `|`: logical OR
- `!` logical NOT
For example, if we wanted only those rows in our data where NEE is based on measured or good quality gap-filled NEE data, we write:
```{r eval=FALSE}
half_hourly_fluxes |>
filter(NEE_VUT_REF_QC == 0 | NEE_VUT_REF_QC == 1)
```
For evaluating multiple OR operations simultaneously, we can write alternatively and equivalently:
```{r eval=FALSE}
half_hourly_fluxes |>
filter(NEE_VUT_REF_QC %in% c(0,1))
```
Note that `filter()` removes entire rows. In some cases this is undesired and it is preferred to replace bad quality values with `NA`. It is important to note that specifying a value as missing is information itself. Dropping an entire row leads to the loss of this information. For cases where we do not want to drop entire rows when applying `filter()`, we can just replace certain values with `NA`. In our case, where we want to retain only data where NEE is based on actual measurements or good quality gap-filling, we can do this by:
```{r eval=FALSE}
half_hourly_fluxes |>
mutate(GPP_NT_VUT_REF = ifelse(NEE_VUT_REF_QC %in% c(0,1), GPP_NT_VUT_REF, NA))
```
If we decide to drop a row containing `NA` in any of the variables later during the workflow, we can do this, for example using the useful {tidyr} function `drop_na()`.
```{r eval=FALSE}
half_hourly_fluxes |>
drop_na()
```
An excellent source for a more comprehensive introduction to missing data handling is given in [Kuhn and Johnson (2019)](https://bookdown.org/max/FES/handling-missing-data.html).
### Writing data to CSV
After having applied some data reduction and cleaning steps above, let's save the data frame in the form of a CSV file for use in later chapters.
```{r}
write_csv(half_hourly_fluxes, file = "data/FLX_CH-Lae_FLUXNET2015_FULLSET_HH_2004-2006_CLEAN.csv")
```
> Note: Making a file publicly available as a `.rds` or `.RData` file (explained in Chapter \@ref(gettingstarted)) violates the open science principles (introduced in Chapter \@ref(openscience)). These two formats make it very easy to save R objects related to your analysis project, but are not adequate to save *data*. Therefore, whenever possible, save your data in a format that is readable across platforms without requiring proprietary software. Hence use `write_csv()` from {readr} whenever possible. We will encounter other non-proprietary formats that let you save and share more complex data structures in Chapter \@ref(datavariety).
### Combining relational data
Often, data are spread across multiple files and tables and need to be combined for the planned analysis. In the simplest case, data frames have an identical number of columns, arranged in the same order, and we can "stack" them along rows:
```{r echo=FALSE}
co2_conc_subset_1 <- co2_concentration_yearly[1:6,]
co2_conc_subset_2 <- co2_concentration_yearly[7:12,]
```
```{r}
co2_conc_subset_1
co2_conc_subset_2
bind_rows(co2_conc_subset_1, co2_conc_subset_2)
```
In other cases, data frames may have an identical set of rows (and arranged in the same order) and we can "stack" them along columns.
```{r echo=FALSE}
co2_conc_subset_3 <- co2_concentration_yearly[, c(1,2,3)]
co2_conc_subset_4 <- co2_concentration_yearly[, c(1,4)]
```
```{r}
co2_conc_subset_3
co2_conc_subset_4
bind_cols(co2_conc_subset_3, co2_conc_subset_4)
```
But beware! In particular the stacking along columns (`bind_cols()`) is very error-prone and should be avoided. Since a *tidy* data frame regards each row as an instance of associated measurements, the rows of the two data frames and their order must match exactly. Otherwise, an error is raised or (even worse) rows get associated when they shouldn't be. In such cases, where information about a common set of observations is distributed across multiple data objects, we are dealing with *relational data*. The key for their combination (or "merging") is the *join variable* - the column that is present in both data frames and which contains values along which the merging of the two data frames is performed. In our example from above, this is `month`, and we can use the {dplyr} function `left_join()`.
```{r}
co2_conc_subset_3 |>
slice(sample(1:n(), replace = FALSE)) |> # re-shuffling rows
left_join(co2_conc_subset_4, by = "month") |>
arrange(month) # sort in ascending order
```
Note that here, we first re-shuffled (permuted) the rows of `df6` for demonstration purposes, and arranged the output data frame again by `month` - an ordinal variable. `left_join()` is not compromised by the order of the rows, but instead relies on the join variable, specified by the argument `by = "month"`, for associating (merging, joining) the two data frames. In some cases, multiple columns may act as the joining variables in their combination (for example `by = c("year", "month")`).
Other variants of `*_join()` are available as described [here](https://r4ds.had.co.nz/relational-data.html).
## Extra material {#extramaterialwrangling}
### Functional programming I
Above, we read a CSV table into R and applied several data transformation steps. In practice, we often have to apply the same data transformation steps repeatedly over a set of similar objects. This *extra material* section outlines an example workflow for demonstrating how to efficiently work with lists of similar objects - in particular, lists of data frames.
Our aim is to read a set of files into R data frames and apply transformation steps to each data frame separately. Here, we will work with daily data, not half-hourly data. The daily data contains largely identical variables with consistent naming and units as in the half-hourly data (description above). Let's start by creating a list of paths that point to the files with daily data. They are all located in the directory `"./data"` and share a certain string of characters in their file names `"_FLUXNET2015_FULLSET_DD_"`.
```{r}
vec_files <- list.files("./data", pattern = "_FLUXNET2015_FULLSET_DD_", full.names = TRUE)
print(vec_files)
```
> To reproduce this code chunk, you can download the files `FLX_CH-Lae_FLUXNET2015_FULLSET_HH_2004-2006.csv` from [here](https://github.com/geco-bern/agds/tree/main/data) and read it from the local path where the file is stored on your machine.
`vec_files` is now a vector of three files paths as character strings. To read in the three files and combine the three data frames (`list_df` below) into a list of data frames, we could use a `for` loop:
```{r message=FALSE, eval=FALSE}
list_df <- list()
for (ifil in vec_files){
list_df[[ifil]] <- read_csv(ifil)
}
```
Repeatedly applying a function (here `read_csv()`) over a list similar objects is facilitated by the `map*()` family of functions from the {purrr} package. An (almost) equivalent statement is:
```{r message=FALSE}
list_df <- purrr::map(as.list(vec_files), ~read_csv(.))
```
Here, `purrr::map()` applies the function `read_csv()` to elements of a *list*. Hence, we first have to convert the vector `vec_files` to a list. A list is always the first argument within the `purrr::map()` function. Note two new symbols (`~` and `.`). The `~` always goes before the function that is repeatedly applied (or "mapped") to elements of the list. The `.` indicates where the elements of the list would go if spelled out (e.g., here, `read_csv(.)` would be `read_csv("./data/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv")` for the first iteration). The output of `purrr::map()` is again a list. There are many variants of the function `purrr::map()` that each have a specific use. A complete reference for all {purrr} functions is available [here](https://purrr.tidyverse.org/reference/index.html). A useful and more extensive tutorial on {purrr} is available [here](https://www.r-bloggers.com/one-stop-tutorial-on-purrr-package-in-r/).
The above `purrr::map()` call does not return a *named* list as our `for` loop created. But we can give each element of the returned list of data frames different names by:
```{r}
names(list_df) <- vec_files # this makes it a named list
```
Next, we will apply a similar data cleaning procedure to this data set as we did above for half-hourly data. To do so, we "package" the individual cleaning steps into a function ...
```{r}
# function definition
clean_data_dd <- function(df){
df <- df |>
# select only the variables we are interested in
dplyr::select(
TIMESTAMP,
ends_with("_F"),
GPP_NT_VUT_REF,
NEE_VUT_REF_QC,
starts_with("SWC_F_MDS_"),
-contains("JSB")) |>
# convert to a nice date object
dplyr::mutate(TIMESTAMP = lubridate::ymd(TIMESTAMP)) |>
# set all -9999 to NA
dplyr::mutate(across(where(is.numeric), ~na_if(., -9999)))
return(df)
}
```
... and apply this function to each data frame within our list of data frames:
```{r}
list_df <- purrr::map(list_df, ~clean_data_dd(.))
```
Having different data frames as elements of a list may be impractical. Since we read in similarly formatted files and selected always the same variables in each data frame, all elements of the list of data frames `list_df` share the same columns. This suggests that we can collapse our list of data frames and "stack" data frames along rows. As described above, this can be done using `bind_rows()` and we can automatically create a new column `"siteid"` in the stacked data frame that takes the name of the corresponding list element.
```{r}
daily_fluxes_allsites <- bind_rows(list_df, .id = "siteid")
daily_fluxes_allsites
```
A visualisation of missing data indicates that soil water content data (`SWC_F_MDS_*`) are often missing.
```{r }
# create a subset of the data
daily_fluxex_subset <- daily_fluxes_allsites |>
slice(1:10000)
# visualize missing data
visdat::vis_miss(
daily_fluxex_subset,
cluster = FALSE,
warn_large_data = FALSE
)
```
### Strings {#strings}
The column `siteid` currently contains strings specifying the full paths of the files that were read in earlier. The next task is to extract the site name from these strings. The file names follow a clear pattern (this also highlights why naming files wisely can often make life a lot simpler).
```{r}
daily_fluxes_allsites$siteid |> head()
```
The paths each start with the subdirectory where they are located (`"./data/"`), then `"FLX_"`, followed by the site name (the first three entries of the table containing data from all sites are for the site `"CH-Dav"`), and then some more specifications, including the years that respective files' data cover.
The [{stringr}](https://stringr.tidyverse.org/) package (part of tidyverse) offers a set of functions for working with strings. [Wikham and Grolemund (2017)](https://r4ds.had.co.nz/strings.html) provide a more comprehensive introduction to working with strings. Here, we would like to extract the six characters, starting at position 12. The function `str_sub()` does that job.
```{r}
vec_sites <- str_sub(vec_files, start = 12, end = 17)
head(vec_sites)
```
We can use this function to mutate all values of column `"siteid"`, overwriting it with just these six characters.
```{r}
daily_fluxes_allsites <- daily_fluxes_allsites |>
mutate(siteid = str_sub(siteid, start = 12, end = 17))
daily_fluxes_allsites
```
### Functional programming II
Functions can be applied to a list of objects of any type. Therefore, `purrr::map()` is a powerful approach to "iterating" over multiple instances of the same object type and can be used for all sorts of tasks. In the following, list elements are data frames of daily data and the function `lm()` fits a linear regression model of GPP versus shortwave radiation to each sites' data. We'll learn more about fitting statistical models in R in Chapter \@ref(regressionclassification).
```{r}
list_linmod <- purrr::map(list_df, ~lm(GPP_NT_VUT_REF ~ SW_IN_F, data = .))
```
Note how the `.` indicates where the elements of `list_df` go when evaluating the `lm()` function. This returns a list of linear model objects (the type of objects returned by the `lm()` function call).
We can spin the functional programming concept further and apply (or map) the `summary()` function to the `lm`-model objects to get a list of useful statistics and metrics, and then further extract the element `"r.squared"` from that list as:
```{r}
list_linmod |>
purrr::map(summary) |> # apply the summary() function to each list element
map_dbl("r.squared") # extract R-squared from the list generated by summary()
```
`map_dbl()` is a variant of the `purrr::map()` function that returns not a list, but a vector of numeric values of class "double" (hence, the name `_dbl`). Note further, that providing a character (`"r.squared"`) as an argument instead of an (unquoted) function name, `purrr::map()` extracts the correspondingly named list element, instead of applying a function to a list element.
When writing code for an analysis, it's useful, if not essential, to understand the objects we're working with, understand its type and shape, and make sense of the results of simple `print <object>` statements. Data frames are particularly handy as they provide an organisation of data that is particularly intuitive (variables along columns, observations along rows, values in cells). Here, we're dealing with a list of linear model objects. Can such a list fit into the paradigm of *tidy* data frames?
Yes, they can. Think of the linear model objects as 'values'. Values don't necessarily have to be scalars, but they can be of any type (class).
```{r}
tibble::tibble(
siteid = vec_sites,
linmod = list_linmod
)
```
The fact that cells can contain any type of object offers a powerful concept. Instead of a linear model object as in the example above, each cell may even contain another data frame. In such a case, we say that the data frame is no longer *flat*, but *nested*.
The following creates a nested data frame, where the column `data` is defined by the list of data frames read from files above (`list_df`).
```{r}
tibble::tibble(
siteid = vec_sites,
data = list_df
)
```
We can achieve the same result by directly nesting the flat data frame holding all sites' data (`daily_fluxes_allsites`). This is done by combining the `group_by()`, which we have encountered above when aggregating using `summarise()`, with the function `nest()` from the {tidyr} package.
```{r}
daily_fluxes_allsites |>
group_by(siteid) |>
nest()
```
The function `nest()` names the nested data column automatically `"data"`.
This structure is very useful. For example, for applying functions over sites' data frames separately (and not over the entire data frame). By combining `purrr::map()` and `mutate()`, we can fit linear models on each site's data frame individually in one go.
```{r eval=FALSE}
daily_fluxes_allsites |>
group_by(siteid) |>
nest() |>
dplyr::mutate(linmod = purrr::map(data, ~lm(GPP_NT_VUT_REF ~ SW_IN_F, data = .)))
```
This approach is extremely powerful and lets you stick to working with tidy data frames and use the rows-dimension flexibly. Here, rows are sites and no longer time steps, while the nested data frames in column `"data"` have time steps along their rows. The power of nesting is also to facilitate complex aggregation steps over a specified dimension (or axis of variation, here given by `siteid`), where the aggregating function is not limited to taking a vector as input and returning a scalar, as is the case for applications of `summarise()` (see above).
Combining the steps described above into a single workflow, we have:
```{r}
daily_fluxes_allsites_nested <- daily_fluxes_allsites |>
group_by(siteid) |>
nest() |>
dplyr::mutate(linmod = purrr::map(data, ~lm(GPP_NT_VUT_REF ~ SW_IN_F, data = .))) |>
dplyr::mutate(summ = purrr::map(linmod, ~summary(.))) |>
dplyr::mutate(rsq = map_dbl(summ, "r.squared")) |>
arrange(desc(rsq)) # to arrange output, with highest r-squared on top
daily_fluxes_allsites_nested
```
This code is a demonstration of the power of tidy and nested data frames and for the clarity of the {tidyverse} syntax.
Nesting is useful also for avoiding value duplication when joining relational data objects. Above, we nested time series data objects (where time steps and sites are both organised along rows) by sites and got a data frame where only sites are organised along rows, while time steps are nested inside the column `"data"`. This now fits the structure of a relational data object (`siteinfo_fluxnet2015`) containing site-specific meta information (also with only sites along rows).
```{r}
base::load("data/siteinfo_fluxnet2015.rda") # loads siteinfo_fluxnet2015
```
Joining the nested data frame with site meta information results in a substantially smaller and much handier data frame compared to an alternative, where the site meta information is joined into the un-nested (daily) data frame, and therefore duplicated for each day within sites.
```{r}
daily_fluxes_allsites_nested_joined <- siteinfo_fluxnet2015 |>
rename(siteid = sitename) |>
right_join(
select(daily_fluxes_allsites_nested, -linmod, -summ, -rsq),
by = "siteid"
)
daily_fluxes_allsites_joined <- siteinfo_fluxnet2015 |>
rename(siteid = sitename) |>
right_join(
daily_fluxes_allsites,
by = "siteid"
)
print(paste("Flat and joined:",
format(object.size(daily_fluxes_allsites_joined),
units = "auto",
standard = "SI")))
print(paste("Nested and joined:",
format(object.size(daily_fluxes_allsites_nested_joined),
units = "auto",
standard = "SI")))
# save for later use
write_rds(
daily_fluxes_allsites_nested_joined,
file = "data/daily_fluxes_allsites_nested_joined.rds"
)
```
## Exercises {#exerciseswrangling}
> *Hint*: For all exercises remember the resources we provided on finding help in section \@ref(findinghelp).
### Star wars {.unnumbered}
{dplyr} comes with a toy dataset `dplyr::starwars` (just type it into the console to see its content). Have a look at the dataset with `View()`. Play around with the dataset to get familiar with the {tidyverse} coding style. Use (possibly among others) the functions `dplyr::filter()`, `dplyr::arrange()`, `dplyr::pull()`, `dplyr::select()`, `dplyr::desc()` and `dplyr::slice()` to answer the following questions:
- How many pale characters come from the planets Ryloth and Naboo?
- Who is the oldest among the tallest thirty characters?
- What is the name of the smallest character and their starship in "Return of the Jedi"
> Hint: Use `unnest()` to expand columns that contain lists inside cells. The expansion of such columns creates additional rows in the data frame if the cell contained a list with more than one element.
### Aggregating {.unnumbered}
You have learned about aggregating in the {tidyverse}. Let's put it in practice.
- Reuse the code in the tutorial to read, reduce, and aggregate the `half_hourly_fluxes` dataset to the daily scale, calculating the following metrics across half-hourly `VPD_F` values within each day: mean, 25% quantile, and 75% quantile.
- Retain only the daily data for which the daily mean VPD is among the upper or the lower 10% quantiles.
- Calculate the mean of the 25% and the mean of the 75% quantiles of half-hourly VPD within the upper and lower 10% quantiles of mean daily VPD.
### Patterns in data quality {.unnumbered}
The uncleaned dataset `FLX_CH-Lae_FLUXNET2015_FULLSET_HH_2004-2006.csv` holds half-hourly data that is sometimes of poor quality. Investigate whether NEE data quality is randomly spread across hours in a day by calculating the proportion of (i) actually measured data, (ii) good quality gap-filled data, (iii) medium quality data, and (iv) poor quality data within each hour-of-day (24 hours per day).
> Hint: `summarise(total = n())` aggregates by counting the number of values.
> Hint: `summarise(count_0 = sum(x == 0))` aggregates by counting the number of values for which the evaluation is `TRUE`.
Interpret your findings: Are the proportions evenly spread across hours in a day?
Perform an aggregation of the half-hourly GPP data (variable `GPP_NT_VUT_REF`) to daily means of the unmodified data read from file `FLX_CH-Lae_FLUXNET2015_FULLSET_HH_2004-2006.csv`, and from cleaned data where only measured (not gap-filled) data is kept. This yields two data frames with daily GPP data. Calculate the overall mean GPP for the two data frames (across all days in the data frame). Are the overall mean GPP values equal? If not, why?
## Report Exercise {#retidy}
### Analyzing changes in soil organic matter during elevated CO$_2$ experiments {.unnumbered}
Open Science requires that data underlying published research articles is made available upon publication of the article. A separate aspect is the format of the shared data. Is it provided in an open-access data format? How easy is it to use the data for your own analyses? In this exercise, you will encounter data that was made freely available, but not in an open access format. Although it is "nice-looking" data, you will encounter that it is not tidy.
In this exercise, you will investigate the data published by Groeningen et al. (2014), where they looked at how soil carbon stocks may change due to the anthropogenic increase in $CO_2$. They gathered data on changes in the soil organic matter content from experiments where ecosystems were exposed to elevated CO$_2$ concentrations and your task is to have a high-level look at this dataset. So, perform the following steps:
1. Download the data file (`.xlsx`) from the Supplementary Material of the following paper: *Groenigen, Kees Jan van, Xuan Qi, Craig W. Osenberg, Yiqi Luo, and Bruce A. Hungate. "Faster Decomposition Under Increased Atmospheric CO2 Limits Soil Carbon Storage." Science 344, no. 6183 (May 2, 2014): 508--9. <https://doi.org/10.1126/science.1249534>.*
2. Manually clean the data in the tab "Database S1" and save it as a CSV file that can be read into R.
- "Database S1" contains data of soil organic carbon measurements in experiments, where ecosystems are exposed to ambient (low) and elevated (high) CO$_2$ concentrations. The mean soil organic carbon of multiple samples ("n") is recorded within each experiment for different sample dates. Information is provided for the time in years since the start of the experiment ("Time (years)").
<!-- - _Bonus (not part of the assessment): If you feel like challenging yourself, you could try cleaning the data directly within R and never touching Excel! Can you find a way to do this with `readxl::read_xlsx()`?_ -->
3. In RStudio, create RMarkdown file. Then, write your code into the R chunks of the file to aggregate the data per experiment and calculate the log-response ratio within each experiment, as specified below.
- A log-response ratio can be used to quantify the effect that a treatment (e.g., elevated CO$_2$) can have on your target variable $x$ (e.g., soil organic matter content). The log-response ratio can be calculated as: $\text{RR} = \ln \left( \frac{x_\text{elevated}}{x_\text{ambient}} \right)$
- Aggregate data across all experiments for different years since the start of the experiment, distinguishing an early phase (\<3 years since start), a mid-phase (3-6 years since start), and a late phase (\>6 years since start). Calculate the log-response ratio for each phase. Calculate the log-response ratio for each parallel observation of SOC under ambient and elevated CO$_2$, and then aggregate log response ratios by taking their mean (and not the other way round).
- Present your results as tables using the `knitr::kable()` function.
- _Tip: Depending on your Excel settings, your exported csv is separated using `,` or `;`. The `read_csv()` function only works with `,`. So, if your file is separated with `;`, either change the export settings or use `read_csv2()`.Where would it have been useful if more information had been available._
4. Answer the following questions:
- What are the data that you are looking at?
- What do you expect your analysis to show, what is your hypothesis? How should soil organic matter content change under elevated CO$_2$?
- Interpret your results after aggregating the data: What do your final numbers mean? Do they support your initial hypothesis? Why so, why not?
- _Tip: Skim the paper and its references to better understand the data and processes you are looking at!``_
### Deliverables for the report {.unnumbered}
A key learning of this course is that you know how to create reproducible workflows and we pay a lot of attention on this during the assessment. You will learn more about the specific aspects to do so in Chapters \@ref(openscience) and \@ref(codemgmt). So, if the following instructions for submitting this report exercise may sound cryptic, hang tight! It will make more sense as you work through this course.
You have to submit all report exercises via your GitHub repository that you created in \@ref(setupreport). For now, save this report exercise as a RMarkdown file with the name `re_tidy.Rmd` and place that file in a sub-directory called `vignettes` (a directory is simply a folder), which is located in your project directory (in full notation that means save your RMarkdown as `./vignettes/re_tidy.Rmd`). Additionally, produce a HTML version of your solution by knitting your RMarkdown file (see the `Knit` command in the panel below the files tabs in RStudio). Save this HTML file in the same `vignettes` directory (`./vignettes/re_tidy.html`). For your cleaned dataset, pick a sensible name and put it into a sub-direcory called `./data`. As with all other report exercises, make sure that your work is reproducible, for example, by letting another student download your repository from GitHub and knit your RMarkdown files on their computer.
> **Important**: Structure your RMarkdown so that you first load necessary libraries, load your data and then write down your solution. **Remember to access files with relative paths and not with hard-coded paths**. So, access your file via `here::here('data/tidy_data.csv`), where `here` starts at the position of your `.Rproj` file. **Never** write hard-coded paths that looks like this: `"~/Desktop/Studium/.../tidy_data.csv"` or `"C:/Users/.../tidy_data.csv"` (see Chapter \@ref(wspmgmt)).
> Tip: Your results do not have to match the results from the paper. It is important that you make your code reproducible, aggregate the data as instructed, and interpret your results.
> Tip: If you are new to using RMarkdown, check out this [guide](https://www.dataquest.io/blog/r-markdown-guide-cheatsheet/).