forked from balaji-gfdl/wippaper
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathwippaper.tex
3329 lines (2971 loc) · 157 KB
/
wippaper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% \documentclass[12pt,twocolumn]{article}
% Copernicus stuff
\documentclass[gmd,manuscript]{copernicus}
%\documentclass[gmd,manuscript]{../171128_Copernicus_LaTeX_Package/copernicus} %durack
% page/line labeling and referencing
% from http://goo.gl/HvS9BK
\newcommand{\pllabel}[1]{\label{p-#1}\linelabel{l-#1}}
\newcommand{\plref}[1]{see page~\pageref{p-#1}, line~\lineref{l-#1}.}
% answer environment for reviewer responses
\newenvironment{answer}{\color{blue}}{}
\usepackage{enumitem}
\hypersetup{colorlinks=true,urlcolor=blue,citecolor=red}
% \hypersetup{colorlinks=false}
% \newcommand{\degree}{\ensuremath{^\circ}}
% \newcommand{\order}{\ensuremath{\mathcal{O}}}
% \newcommand{\bibref}[1] { \cite{ref:#1}}
% \newcommand{\pipref}[1] {\citep{ref:#1}}
% \newcommand{\ceqref}[1] {\mbox{CodeBlock \ref{code:#1}}}
% \newcommand{\charef}[1] {\mbox{Chapter \ref{cha:#1}}}
% \newcommand{\eqnref}[1] {\mbox{Eq. \ref{eq:#1}}}
% \newcommand{\figref}[1] {\mbox{Figure \ref{fig:#1}}}
% \newcommand{\secref}[1] {\mbox{Section \ref{sec:#1}}}
% \newcommand{\appref}[1] {\mbox{Appendix \ref{sec:#1}}}
% \newcommand{\tabref}[1] {\mbox{Table \ref{tab:#1}}}
\newcommand{\urlref}[2] {\href{#1}{#2}\footnote{\url{#1}, retrieved \today.}}
\newcommand{\editorial}[1]{\protect{\color{red}#1}}
\runningtitle{WIP Paper Draft \today}
\runningauthor{Balaji et al.}
\begin{document}
\title{Requirements for a global data infrastructure in support of CMIP6}
\Author[1,2]{Venkatramani}{Balaji}
\Author[3]{Karl E.}{Taylor}
\Author[4]{Martin}{Juckes}
\Author[5,4]{Bryan N.}{Lawrence}
\Author[6]{Michael}{Lautenschlager}
\Author[7,2]{Chris}{Blanton}
\Author[8]{Luca}{Cinquini}
\Author[9]{S\'ebastien}{Denvil}
\Author[3]{Paul J.}{Durack}
\Author[10]{Mark}{Elkington}
\Author[9]{Francesca}{Guglielmo}
\Author[9,4]{Eric}{Guilyardi}
\Author[4]{David}{Hassell}
\Author[11]{Slava}{Kharin}
\Author[6]{Stefan}{Kindermann}
\Author[1,2]{Sergey}{Nikonov}
\Author[7,2]{Aparna}{Radhakrishnan}
\Author[6]{Martina}{Stockhause}
\Author[6]{Tobias}{Weigel}
\Author[3]{Dean}{Williams}
\affil[1]{Princeton University, Cooperative Institute of Climate
Science, Princeton NJ, USA}
\affil[2]{NOAA/Geophysical Fluid Dynamics Laboratory, Princeton NJ,
USA}
\affil[3]{PCMDI, Lawrence Livermore National Laboratory, Livermore, CA, USA}
\affil[4]{Science and Technology Facilities Council, Abingdon, UK}
\affil[5]{National Center for Atmospheric Science and University of
Reading, UK}
\affil[6]{Deutsches KlimaRechenZentrum GmbH, Hamburg, Germany}
\affil[7]{Engility Inc., NJ, USA}
\affil[8]{Jet Propulsion Laboratory (JPL), 4800 Oak Grove Drive,
Pasadena, CA 91109, USA}
\affil[9]{Institut Pierre-Simon Laplace, CNRS/UPMC, Paris, France}
\affil[10]{Met Office, FitzRoy Road, Exeter, EX1 3PB, UK}
\affil[11]{Canadian Centre for Climate Modelling and Analysis, Atmospheric Environment Service, University of Victoria, BC, Canada}
% \affil[10]{NCAR}
\correspondence{V. Balaji (\texttt{[email protected]})}
\received{}
\pubdiscuss{} %% only important for two-stage journals
\revised{}
\accepted{}
\published{}
%% These dates will be inserted by Copernicus Publications during the typesetting process.
\firstpage{1}
\maketitle
% \pagebreak
\abstract{The World Climate Research Programme (WCRP)'s Working Group
on Climate Modeling (WGCM) Infrastructure Panel (WIP) was formed in
2014 in response to the explosive growth in size and complexity of
Coupled Model Intercomparison Projects (CMIPs) between CMIP3
(2005-06) and CMIP5 (2011-12). This article presents the WIP
recommendations for the global data infrastructure needed to support
CMIP design, future growth and evolution. Developed in close
coordination with those who build and run the existing
infrastructure (the Earth System Grid Federation), the
recommendations are based on several principles beginning with the
need to separate requirements, implementation, and operations. Other
important principles include the consideration of
\pllabel{RC2-2}
the diversity of community needs around data -- a \emph{data
ecosystem} -- the importance of provenance, the need for
automation, and the obligation to measure costs and benefits.
This paper concentrates on requirements, recognising the diversity
of communities involved (modelers, analysts, software developers,
and downstream users). Such requirements include the need for
scientific reproducibility and accountability alongside the need
to record and track data usage.
\pllabel{RC1-1}
One key element is to generate a dataset-centric rather than
system-centric focus, with an aim to making the infrastructure less
prone to systemic failure.
With these overarching principles and requirements, the WIP has
produced a set of position papers, which are summarized here. They
provide specifications for managing and delivering model output,
including strategies for replication and versioning, licensing, data
quality assurance, citation, long-term archival, and dataset
tracking. They also describe a new and more formal approach for
specifying what data, and associated metadata, should be saved,
which enables future data volumes to be estimated.
The paper concludes with a future-facing consideration of the global
data infrastructure evolution that follows from the blurring of
boundaries between climate and weather, and the changing nature of
published scientific results in the digital age. }
% \pagebreak
\introduction
\label{sec:intro}
CMIP6 \citep{ref:eyringetal2016a}, the latest Coupled Model
Intercomparison Project (CMIP), can trace its genealogy back to the
Charney Report \citep{ref:charneyetal1979}. This seminal report on the
links between CO$_2$ and climate was an authoritative summary of the
state of the science at the time, and produced findings that have
stood the test of time \citep{ref:bonyetal2013}. It is often noted
\citep[see, e.g][]{ref:andrewsetal2012}
\pllabel{RC1-2}
that the range and uncertainty bounds on equilibrium climate
sensitivity generated in this report have not fundamentally changed,
despite the enormous increase in resources devoted to analysing the
problem in decades since.
Beyond its
\pllabel{RC2-4}
enduring findings on climate sensitivity, the Charney Report also gave
rise to a methodology for the treatment of uncertainties and gaps in
understanding, which has been equally influential, and is in fact the
basis of CMIP itself. The Report can be seen as one of the first uses
of the \emph{multi-model ensemble}. At the time, there were two models
available \pllabel{RC1-3} representing the equilibrium response of the
climate system to a change in CO$_2$ forcing, one from Syukuro
Manabe's group at NOAA's Geophysical Fluid Dynamics Laboratory, and
the other from James Hansen's group at NASA's Goddard Institute for
Space Studies. Then as now, these groups marshaled vast
state-of-the-art computing and data resources to run very challenging
simulations of the Earth system. The Report's results were based on an
ensemble of
\pllabel{RC2-5}
three runs from the Manabe group, \pllabel{RC1-4} labeled M1-M3, and two
from the Hansen group, \pllabel{RC1-5} labeled H1-H2.
The Atmospheric Model Intercomparison Project
\citep[AMIP:][]{ref:gates1992} was one of the first systematic
cross-model comparisons open to anyone who wished to participate.
\pllabel{RC1-6}
By the time of the Inter-Governmental Panel on Climate Change (IPCC)'s
First Assessment Report (FAR) in 1990 \citep{ref:houghtonetal1992},
\pllabel{RC1-9}
the process had been formalized. At this stage, there were
\pllabel{RC2-6}
five models participating in the exercise, and some of what
\pllabel{RC2-7}
is now called the ``Diagnosis, Evaluation, and Characterization of
Klima'' \citep[DECK, see][]{ref:eyringetal2016a}
experiments\footnote{``Klima'' is German for ``climate''.} had been
standardized (AMIP, a pre-industrial control, 1\% per year CO$_2$
increase to doubling, etc). The ``scenarios'' had emerged as well, for
a total of
\pllabel{RC2-6b}
five different experimental protocols. Fast-forwarding to today, CMIP6
expects more than 75 models from around 35 modeling centers \citep[in
14 countries, a stark contrast to the US monopoly
in][]{ref:charneyetal1979} to participate in the DECK and historical
experiments \citep[Table~2 of][]{ref:eyringetal2016a}, and some subset
of these to participate in one or more the 21 MIPs endorsed by the
CMIP Panel \citep[Table~3 of][, now 23 with two new endorsed MIPs
since]{ref:eyringetal2016a}. \pllabel{RC1-7} The MIPs call for over
200 experiments, a considerable expansion over CMIP5.
Alongside the experiments themselves is the data request which
defines, for each CMIP experiment, what output each model should
provide for analysis. The complexity of this data request has also
grown tremendously over the CMIP era. A typical dataset from the FAR
archive (\urlref{https://goo.gl/M1WSJy}{from the GFDL R15 model}) lists
climatologies and time series of two variables, and the dataset size
is about 200~MB. The CMIP6 Data Request \cite{ref:juckesetal2015}
lists literally thousands of variables from the hundreds of
experiments mentioned above. This growth in complexity is testament to
the modern understanding of many physical, chemical and biological
processes which were simply absent from the Charney Report era models.
The simulation output is now a primary scientific resource for
researchers the world over, rivaling the volume of observed weather
and climate data from the global array of sensors and satellites
\citep{ref:overpecketal2011}. Climate science, and observed and simulated
climate data in particular, have now become primary elements in the
``vast machine'' \citep{ref:edwards2010} serving the global climate and
weather enterprise.
% It could be worthwhile to quantify (in $USD) the impact, as forecasting
% in particular has yielded considerable social and economic gains
Managing and sharing this huge amount of data is an enterprise in its
own right -- and the solution established for CMIP5 was the global
Earth System Grid Federation
\citep[ESGF,][]{ref:williamsetal2011a,ref:williamsetal2015}. ESGF was
identified by the WCRP Joint Scientific Committee in 2013 as the
recommended infrastructure for data archiving and dissemination for
the Programme.
\pllabel{RC2-12}
A map of sites participating in the ESGF are shown in
\pllabel{RC2-8}
Figure~\ref{fig:esgf} drawn from
\urlref{https://portal.enes.org/data/is-enes-data-infrastructure/esgf}{IS-ENES
Data Portal}. The sites are diverse and responsive to many national
and institutional missions. With multiple agencies and institutions,
and many uncoordinated and possibly conflicting requirements, the ESGF
itself is a complex and delicate
\pllabel{RC2-10}
artifact to manage.
\begin{figure*}
\begin{center}
\includegraphics[width=175mm]{images/esgf-map-2017.png}
\end{center}
\caption{Sites participating in the Earth System Grid Federation in
May 2017. Figure courtesy IS-ENES Data Portal. }
\label{fig:esgf}
\end{figure*}
The sheer size and complexity of this infrastructure emerged as a
matter of great concern at the end of CMIP5, when the growth in data
volume relative to CMIP3 (from 40~TB to 2~PB, a 50-fold increase in 6
years) suggested the community was on an unsustainable path. These
concerns led to the 2014 recommendation of the WGCM to form an
\emph{infrastructure panel} (based upon
\pllabel{RC2-11}
\urlref{https://goo.gl/FHqbNN}{a proposal} at the 2013 annual
meeting). The WGCM Infrastructure Panel (WIP) was tasked with
examining the global computational and data infrastructure
underpinning CMIP, and improving communication between the teams
overseeing the scientific and experimental design of these globally
coordinated experiments, and the teams providing resources and
designing that infrastructure. The communication was intended to be
two-way: providing input both to the provisioning of infrastructure
appropriate to the experimental design, and informing the scientific
design of the technical (and financial) limits of that infrastructure.
This paper provides a summary of the findings by the WIP in the first
three years of activity since its formation in 2014, and the
consequent recommendations -- in the context of existing
organisational and funding constraints.
\pllabel{RC1-Overview-2}
In the text below, we refer to \emph{findings}, \emph{requirements},
and \emph{recommendations}. Findings refer to observations about the
state of affairs: technologies, resource constraints, and the like,
based upon our analysis. Requirements are design goals that have been
shared with those building the infrastructure, such as the ESGF
software stack. Recommendations are our guidance to the community:
experiment designers, modeling centres, and the users of climate data.
\pllabel{RC1-Overview-1}
The intended audience for the paper is primarily the scientific
community around CMIP6. In particular, we aim to show how the
scientific design of CMIP6 as outlined in \cite{ref:eyringetal2016a}
translates into infrastructural requirements. We hope this will be
instructive to creators of multi-model experiments as to the resource
implications of their experimental design, and for data providers
(modeling centres), explain the sometimes opaque requirements imposed
upon them as a requisite for participation. We believe an explanation
may also be useful who find data acquisition and analysis a technical
challenge, to understand the design of infrastructure in a
resource-constrained environment. Finally, we hope this will be of
interest to general readers of the journal from other geoscience
fields, illuminating the particular character of global data
infrastructure for climate data, where the community of users far
outstrip in numbers and diversity, the Earth system modeling community
itself.
In Section~\ref{sec:principles}, the principles and scientific
rationale underlying the requirements for global data infrastructure
are articulated. In Section~\ref{sec:dreq} the CMIP6 Data Request is
covered: standards and conventions, requirements for modeling centers
to process a complex data request, and projections of data volume. In
Section~\ref{sec:licensing}, recent evolution in how data are archived
is reviewed alongside a licensing strategy consistent with current
practice and scientific principle. In Section~\ref{sec:cite} issues
surrounding data as a citable resource are discussed, including the
technical infrastructure for the creation of citable data, and the
documentation and other standards required to make data a first-class
scientific entity. In Section~\ref{sec:replica} the implications of
data replicas and in Section~\ref{sec:version} issues surrounding data
versioning, retraction, and errata are addressed.
Section~\ref{sec:summary} provides an outlook for the future of global
data infrastructure, looking beyond CMIP6 towards a unified view of
the ``vast machine'' for weather and climate computation and data.
\section{Principles and Constraints}
\label{sec:principles}
This section lays out some of the the principles and constraints which
have resulted from the evolution of infrastructure requirements since
the first CMIP experiment -- beginning with the historical context.
\subsection{Historical Context}
\label{sec:history}
In the pioneering days of CMIP, the community of participants was
small and well-knit, and all the issues involved in generating
datasets for common analysis from different modeling groups could be
settled by mutual agreement (Ron Stouffer, personal communication).
Analysis was performed by the same community that performed the
simulations. The Program for Climate Model Diagnostics and
Intercomparison (PCMDI), established in 1989, had championed the idea
of more systematic analysis of models, and in close cooperation with
the climate modeling centers, PCMDI assumed responsibility for much of
the day-to-day coordination of CMIP. Until CMIP3, the hosting of
datasets from different modeling groups could be managed at a single
archival site; PCMDI alone hosted the entire 40~TB archive.
From its earliest phases, CMIP grew in importance, and its results
provided a major pillar supporting the periodic Intergovernmental
Panel on Climate Change (IPCC) assessment activity. However, the
explosive growth in the scope of CMIP, especially between CMIP3 and
CMIP5, represented a tipping point in the supporting infrastructure.
Not only was it clear that no one site could manage all the data, the
necessary infrastructure software and operational principles could no
longer be delivered and managed by PCMDI alone.
For CMIP5, PCMDI sought help from a number of partners under the
auspices of the Global Organisation of Earth System Science Portals
(GO-ESSP). In the main, the GO-ESSP partners who became the foundation
members and developers of the Earth System Grid Federation retargeted
existing research funding to develop ESGF. The primary heritage was
the original U.S. Earth System Grid Federation project, but major
components came from new international partners. This meant that many
aspects of the ESGF system began from work which was designed in the
context of different requirements, collaborations, and objectives. At
the beginning, none of the partners had funds for operational support
for the fledgling international federation, and even after the end of
CMIP5 proper, the ongoing ESGF has been sustained primarily by small
amounts of funding at a handful of the ESGF sites. Most ESGF sites
have had little or no formal operational support. Many of the known
limitations of the CMIP5 ESGF -- both in terms of functionality and
performance -- were a direct consequence of this heritage.
With the advent of CMIP6, it was clear that
\pllabel{RC2-14}
a fundamental reassessment would be needed to address the evolving
scientific and operational requirements. That clarity led to the
establishment of the WIP, but it has yet to lead to any formal joint
funding arrangement -- the ESGF and the data nodes within it remain
funded (if at all, many data nodes are marginal activities supported
on best efforts) by national agencies with disparate timescales and
objectives. Several critical software elements also are being
developed on volunteer efforts and shoestring budgets. This finding
has been noted in the US National Academies Report on ``A National
Strategy for Advancing Climate Modeling'' \citep{ref:nasem2012}, which
warned of the consequences of inadequate infrastructure funding.
\subsection{Infrastructural Principles}
\label{sec:infra-principles}
\begin{enumerate}
\item With greater complexity and a globally distributed data
resource, it has become clear that in the design of globally
coordinated scientific experiments, the global computational and
data infrastructure needs to be formally examined as an integrated
element.
The membership of the WIP, drawn as it is from experts in various
aspects of the infrastructure, is a direct consequence of this
requirement for integration. Representatives of modeling centers,
infrastructure developers, and stakeholders in the scientific design
of CMIP and its output comprise the panel membership. One of the
WIP's first acts was to consider three phases in the process of
infrastructure development: \emph{requirements},
\emph{implementation}, and \emph{operations}, all informed by the
builders of workflows at the modeling centers.
\begin{itemize}
\item The WIP, in consort with the CMIP Panel, takes responsibility
to articulate \emph{requirements} for the infrastructure.
\item The \emph{implementation} is in the hands of the
infrastructure developers, principally ESGF for the federated
archive \citep{ref:williamsetal2015}, but also related projects
like Earth System Documentation
\citep[\urlref{https://goo.gl/WNwKD9}{ES-DOC},][]{ref:guilyardietal2013}.
\item In 2016 at the WIP's request, the CMIP6 Data Node
\emph{Operations} Team (CDNOT) was formed.
\pllabel{RC3-22}
It is charged with ensuring that all the infrastructure elements
needed by CMIP6 are properly deployed and actually working as
intended at the sites hosting CMIP6 data. It is also responsible
for the operational aspects of the federation itself, including
specifying what versions of the toolchain are run at every site at
any given time, and organizing coordinated version upgrades across
the federation.
\end{itemize}
Although there is now a clear separation of concerns into
requirements, implementation, and operations, close links are
maintained by cross-membership between the key bodies, including the
WIP itself, the CMIP Panel, the ESGF Executive Committee, and the
CDNOT.
\item\label{broad} With the basic fact of anthropogenic climate change
now well established \citep[see, e.g.,][]{ref:stockeretal2013} the
scientific communities with an interest in CMIP is expanding. For
example, a substantial body of work has begun to emerge to examine
climate impacts. In addition to the specialists in Earth system
science -- who also design and run the experiments and produce the
model output -- those relying on CMIP output now include those
developing and providing climate services, as well as
\emph{consumers} from allied fields studying the impacts of climate
change on health, agriculture, natural resources, human migration,
and similar issues \citep{ref:mossetal2010}. This confronts us with
a \emph{scientific scalability} issue (the data during its lifetime
will be consumed by a community much larger, both in sheer numbers,
and also in breadth of interest and perspective than the Earth
system modeling community itself), which needs to be addressed.
Accordingly, we note the requirement that infrastructure should
ensure maximum transparency and usability for user (consumer)
communities at some distance from the modeling (producer)
communities.
\item\label{repro} While CMIP and the IPCC are formally independent,
the CMIP archive is increasingly a reference in formulating climate
policy. Hence the \emph{scientific reproducibility}
\citep{ref:collinstabak2014} and the underlying \emph{durability}
and \emph{provenance} of data have now become matters of central
importance: being able to trace
\pllabel{RC2-15}
back, long after the fact, from model output to the configuration of
models, and procedures and choices made along the way. This led the
IPCC to require data distribution centers (DDCs) to attempt to
guarantee the archival and dissemination of this data in perpetuity,
and consequently to a requirement in the CMIP context of
achieving reproducibility. Given the use of multi-model ensembles
for both consensus estimates and uncertainty bounds on climate
projections, it is important to document -- as precisely as
possible, given the independent genealogy and structure of many
models -- the details and differences among model configurations and
analysis methods, to deliver both the requisite provenance and the
routes to reproduction.
\item\label{analysis} With the expectation that CMIP DECK experiment
results should be routinely contributed to CMIP, opportunities now
exist for engaging in a more systematic and routine evaluation of
Earth System Models (ESMs). This has led to community efforts to
develop standard metrics of model ``quality''
\citep{ref:eyringetal2016,ref:gleckleretal2016}.
\pllabel{RC2-16}
Typical multi-model analysis has hitherto taken the multi-model
average, assigning equal weight to each model, as the most likely
estimate of climate response. This ``model democracy''
\citep{ref:knutti2010} has been called into question and there is
now a considerable literature exploring the potential of weighting
models by quality \citep{ref:knuttietal2017}. The development of
standard metrics would aid this kind of research.
To that end, there is now a requirement to enable through the ESGF a
framework for accommodating quasi-operational evaluation tools that
could routinely execute a series of standardized evaluation tasks.
This would provide data consumers with an increasingly (over time)
systematic characterization of models. It may be some time before a
fully operational system of this kind can be implemented, but
planning must start now.
\pllabel{SC1-1}
In addition, there is an increased interest in climate analytics as
a service \citep{ref:balajietal2011,ref:schnaseetal2017}. This
follows the principle of placing analysis close to the data. Some
centres plan to add resources that combine archival and analysis
capabilities, e.g., NCAR's \urlref{https://goo.gl/sYTxC2}{CMIP
Analysis Platform}, or the UK's JASMIN
\citep{ref:lawrenceetal2013}.. There are also new efforts to bring
climate data storage and analysis to the cloud era
\citep[e.g][]{ref:duffyetal2015}. Platforms such as
\urlref{http://pangeo-data.org/}{Pangeo} show much promise in this
realm, and widespread experimentation and adoption is encouraged.
\item As the experimental design of CMIP has grown in complexity,
costs both in time and money have become a matter of great concern,
particularly for those designing, carrying out, and storing
simulations. In order to justify commitment of resources to CMIP,
mechanisms to identify costs and benefits in developing new models,
performing CMIP simulations, and disseminating the model output need
to be developed.
To quantify the scientific impact of CMIP, measures are needed to
\emph{track} the use of model output and its value to consumers. In
addition to usage quantification, credit and tracing data usage in
literature via citation of data is important. Current practice is at
best citing large data collections provided by a CMIP participant,
or all of CMIP. Accordingly, we note the need for a mechanism to
identify and \emph{cite} data provided by each modeling center.
Alongside the intellectual contribution to model development, which
can be recognized by citation, there is a material cost to centers
in computing and data processing, which is both burdensome
\pllabel{RC1-11}
and poorly understood by those requesting, designing and using the
results from
\pllabel{RC1-12}
CMIP experiments, who might not be in the business of model
development. The criteria for endorsement introduced in CMIP6
\citep[see Table~1 in][]{ref:eyringetal2016a} begins to grapple with
this issue, but the costs still need to be measured and recorded. To
begin documenting these costs for CMIP6, the ``Computational
Performance'' MIP project (CPMIP) \citep{ref:balajietal2017} has
been established, which will \pllabel{RC1-13} measure, among other
things, throughput (simulated years per day) and cost (core-hours
and joules per simulated year) as a function of model resolution and
complexity. Tools for estimating data volumes have also developed,
see Section~\ref{sec:data-request} below.
\item\label{cmplx} Experimental specifications have become ever more
complex, making it difficult to verify that experiment
configurations conform to those specifications.
\pllabel{RC2-17}
Several modeling centers have encountered this problem in preparing
for CMIP6, noting, for example, the challenging intricacies in
dealing with input forcing data \citep[see][]{ref:duracketal2017},
output variable lists \citep{ref:juckesetal2015}, and crossover
requirements between the endorsed MIPs and the DECK
\citep{ref:eyringetal2016a} . Moreover, these protocols inevitably
evolve over time, as errors are discovered or enhancements proposed,
and centers needed to be adaptable in their workflows accordingly.
We note therefore a requirement to encode the protocols to be
directly ingested by workflows, in other words,
\emph{machine-readable experiment design}.
\pllabel{RC1-14}
The intent is to avoid, as far as possible, errors in conformance to
design requirements introduced by the need for humans to transcribe
and implement the protocols, for instance, deciding what variables
to save from what experiments. This is accomplished by encoding most
of the specifications in structured text formats which can be
directly read by the scripts running the model and post-processing,
as explained further below in Section~\ref{sec:dreq}. The
requirement spans all of the \emph{controlled vocabularies} (CVs:
for instance the names assigned to models, experiments, and output
variables) used in the CMIP protocols as well as the CMIP6 Data
Request \citep{ref:juckesetal2015}, which must be stored in
version-controlled, machine-readable formats. Precisely documenting
the \emph{conformance} of experiments to the protocols
\citep{ref:lawrenceetal2012} is an additional requirement.
\item\label{snap} The transition from a unitary archive at PCMDI in
CMIP3 to a globally federated archive in CMIP5 led to many changes
in the way users interact with the archive, which impacts management
of information about users and complicates communications with them.
In particular, a growing number of data users no longer register or
interact directly with the ESGF. Rather they rely on secondary
repositories, often ``snapshots'' of the state of some portion of
the ESGF archive created by others at a particular time (see for
instance the \urlref{https://goo.gl/34AtW6}{IPCC CMIP5 Data
Factsheet}
\pllabel{RC1-15}
for a discussion of the snapshots and their coverage). This meant
that reliance on the ESGF's inventory of registered users for any
aspect of the infrastructure -- such as tracking usage, compliance
with licensing requirements, or informing users about errata or
retractions -- could at best ensure partial coverage of the user
base.
This key finding implies a more distributed design for several
features outlined below, which devolve many of these features to the
datasets themselves rather than the archives. One may think of this
as a \emph{dataset-centric rather than system-centric} design (in
software terms, a \emph{pull} rather than \emph{push} design):
information is made available upon request at the user/dataset
level, relieving the ESGF implementation of an impossible burden.
\end{enumerate}
Based upon these considerations, the WIP produced a set of position
papers (see Appendix~\ref{sec:wip}) encapsulating specifications and
recommendations for CMIP6 and beyond. These papers, summarized below,
are available from the
\urlref{https://www.earthsystemcog.org/projects/wip/}{WIP website}. As
the WIP continues to develop additional recommendations, they too will
be made available. As requirements evolve, a modified document will
be released with a new version number.
\section{A structured approach to data production}
\label{sec:dreq}
The CMIP6 data framework has evolved considerably from CMIP5, and
follows the principles of scientific reproducibility (Item~\ref{repro}
in Section~\ref{sec:principles}), and the recognition that the
complexity of the experimental design (Item~\ref{cmplx}) required far
greater degrees of automation and embedding in workflows. This
requires that all elements in the specification be recorded in
structured text formats (XML and JSON, for example), and subject to
rigorous version control. \emph{Machine-readable} specification of as
many aspects of the model output configuration as possible is a
design goal, as noted earlier.
The data request spans several elements discussed in sub-sections
below.
\subsection{CMIP6 Data Request}
\label{sec:data-request}
\pllabel{RC2-18}
The CMIP6 Data Request is one of the most complex elements of the
CMIP6 infrastructure. It is a direct response to the complexity of the
new design outlined in \cite{ref:eyringetal2016a}. The experimental
design now involves 3 tiers of experiments, where an individual
modeling group may choose which ones to perform; and variables grouped
by scientific goals and priorities, where again centres may choose
which sets to publish, based on interests and resource constraints.
There are also cross-experiment data requests, where for instance the
design may require a variable in one experiment to be compared against
the same variable from a different experiment. The modeling groups
will then need to take this into account before beginning their
simulations. The CMIP6 Data Request is a codification of the entire
experimental design into a structured set of machine-readable
documents, which can in principle be directly ingested in data
workflows.
The \urlref{https://goo.gl/iNBQ9m}{CMIP6 Data Request}
\citep{ref:juckesetal2015} combines definitions of variables and their
output format with specifications of the objectives they support and
the experiments that they are required for. The entire request is
encoded in an XML database with rigorous type constraints. Important
elements of the request, such as units, cell methods (expressing the
subgrid processing implicit in the variable definition), and
frequencies and time ``slices'' (subsets of the entire simulation
period as defined in the experimental design) for required output, are
defined as controlled vocabularies within the request to ensure
consistency of usage. The request is designed to enable flexibility,
allowing modeling centers to make informed decisions about the
variables they should submit to the CMIP6 archive from each
experiment.
% The data request spans several elements.
% \begin{enumerate}
% \item specification of the parameter to be calculated in terms of a CF
% standard name and units,
% \item an output frequency,
% \item a structural specification which includes specification of
% dimensions and of subgrid processing.
% \end{enumerate}
In order to facilitate the cross linking between the 2100 variables
from 248 experiments, the request database allows MIPs to aggregate
variables and experiments into groups. This allows MIPs to designate
variable groups by priority, and allow queries that return a
\emph{Request}, informing the modeling groups of the variables needed
from any given experiment, at the specified time slices and
frequencies.
% The link between variables and
% experiments is then made through the following chain:
% \begin{itemize}
% \item A \emph{variable group}, aggregating variables with priorities
% specific to the MIP defining the group;
% \item A \emph{request link} associating a variable group with an
% objective and a set of request items;
% \item \emph{Request} items associating a particular time slice with a
% request link and a set of experiments.
% \end{itemize}
This formulation takes into account the complexities that arise when a
particular MIP requests that variables needed for their own
experiments should also be saved from a DECK experiment or from an
experiment proposed by a different MIP.
The data request supports a broad range of users who are
provided with a range of different access points. These include the
entire codification in the form of structured (XML) document, web
pages, or spreadsheets, as well as a python API and command-line tools
to satisfy a wide variety of usage patterns for accessing and using
the data request.
% \begin{enumerate}
% \item The XML database provides the reference document;
% \item Web pages provide a direct representation of the database
% content;
% \item Excel workbooks provide selected overviews for specific MIPs and
% experiments;
% \item A python library provides an interface to the database with some
% built-in support functions;
% \item A command line tool based on the python library allows quick
% access to simple queries.
% \end{enumerate}
The data request's machine-readable database has been an extraordinary
resource for the modeling centers. They can, for example, directly
integrate the request specifications with their workflows to ensure
that the correct set of variables are saved for each experiment they
plan to run. In addition, it has given them a new-found ability to
estimate the data volume associated with meeting a MIP's requirements,
a feature exploited below in Section~\ref{sec:dvol}.
\subsection{Model inputs}
\label{sec:data-inputs}
Datasets used by the model for configuration of model inputs
\citep[\texttt{input4MIPs}, see][]{ref:duracketal2017} as well as
observations for comparison with models \citep[\texttt{obs4MIPs},
see][]{ref:teixeiraetal2014} are both now organized in the same way,
and share many of the naming and metadata conventions as the CMIP
model output itself.
\pllabel{RC3-9}
The coherence of standards across model inputs, outputs, and
observational datasets is a development that will enable the community
to build a rich toolset across all of these datasets. The datasets
follow versioning methodologies below in Section~\ref{sec:version}.
\subsection{Data Reference Syntax}
\label{sec:data-drs}
The organization of the model output follows the
\urlref{http://goo.gl/v1drZl}{Data Reference Syntax (DRS)} first used
in CMIP5, and now in somewhat modified form in CMIP6. The DRS depends
on pre-defined \emph{controlled vocabularies} (CVs) for various terms
including: the names of institutions, models, experiments, time
frequencies, etc. The CVs are now recorded as a version-controlled set
of structured text documents, and satisfies the requirement that there
is a \urlref{https://goo.gl/HGafnJ}{single authoritative source for
any CV}, on which all elements in the toolchain will rely. The DRS
elements that rely on these controlled vocabularies appear as netCDF
attributes and are used in constructing file names, directory names,
and unique identifiers of datasets that are essential throughout the
CMIP6 infrastructure. These aspects are covered in detail in the
\urlref{https://goo.gl/mSe4rf}{CMIP6 Global Attributes, DRS,
Filenames, Directory Structure, and CVs} position paper. A new
element in the DRS indicates whether data has been stored on a native
grid or has been regridded (see discussion below in
Section~\ref{sec:dvol} on the potentially critical role of regridded
output). This element of the DRS will allow us to track the usage of
the \emph{regridded subset} of data, and assess the relative
popularity of native-grid vs. standard-grid output.
\subsection{CMIP6 data volumes}
\label{sec:dvol}
As noted, extrapolations based on CMIP3 and CMIP5 lead to some
alarming trends in data volume \citep[see
e.g.,][]{ref:overpecketal2011}.
\pllabel{RC3-10}
As seen in their Figure~2, model output such as those from CMIPs are
beginning to rival observational data volume. As noted in the
Introduction, a particular problem for our community is the diverse
and very large user base for the data, many of whom are not climate
specialists, but downstream users of climate data studying the impacts
of climate change. This stands in contrast to other fields with
comparably large data holdings: data from the Large Hadron Collider
\citep[e.g.,][]{ref:aadetal2008} for example, is primarily consumed by
high energy physicists and not of direct interest to scientists in
unrelated fields.
A rigorous approach is needed to the estimation of future
data volumes, rather than simple extrapolation. Contributions to
increase in data volume include the systematic increase in model
resolution and complexity of the experimental protocol and data
request. We consider these separately:
\begin{description}
\item[Resolution] The median horizontal resolution of a CMIP model
tends to grow with time, and is expected to be more typically 100~km
in CMIP6, compared to 200~km in CMIP5. The vertical resolution grows
in a more controlled fashion, at least as far as the data is
concerned, as often the requested output is reported on a standard
set of atmospheric levels that has not changed much over the years.
Similarly the temporal resolution of the data request does not
increase at the same rate as the model timestep: monthly averages
remain monthly averages. A doubling of model resolution leads
therefore to a quadrupling of the data volume, in principle. But
typically the temporal resolution of the model (though not the data)
is doubled as well, for reasons of numerical stability. Thus, for an
$N$-fold increase in horizontal resolution, we require an $N^3$
increase in computational capacity, which will result in an $N^2$
increase in data volume. We argue therefore, that data volume $V$
and computational capacity $C$ are related as $V \sim C^\frac23$,
purely from the point of view of resolution. The exponent is even
smaller if vertical resolution increases are assumed.
\pllabel{RC1-18}
This is because most 3D model output is requested on sets of
``standard levels'' and thus the output fields do not scale with the
number of model levels (see discussion in the
\urlref{https://goo.gl/wVtm5t}{CMIP6 Output Grid Guidance
document}).
If we then assume that centers will experience an 8-fold increase in
$C$ between CMIPs (which is optimistic in an era of tight budgets),
we can expect a 4-fold increase in data volume. However, this is not
what we experienced between CMIP3 and CMIP5. What caused that
extraordinary 50-fold increase in data volume?
\item[Complexity] The answer lies in the complexity of CMIP: the
complexity of the data request, and of the experimental protocol.
The first component, the
\pllabel{RC1-19}
data request complexity, is related to that of the science: the
number of processes being studied, and the physical variables
required for the study. In CPMIP \citep{ref:balajietal2017}, we have
attempted a rigorous definition of this complexity, measured by the
number of physical variables simulated by the model. This, we argue,
grows not smoothly like resolution, but in very distinct
generational step transitions, such as the one from atmosphere-ocean
models to Earth system models, which involved a substantial jump in
complexity, the number of physical, chemical, and biological species
being modeled, as shown in \cite{ref:balajietal2017}.
\pllabel{RC1-29a}
The dramatic increase in data volume between CMIP3 and CMIP5 was
also due to these causes. Many models of the CMIP5 era added
atmospheric chemistry and aerosol-cloud feedbacks, sometimes with
$\mathcal{O}(100)$ species. CMIP5 also marked the first time in CMIP
that ESMs were used to simulate changes in the carbon cycle and
modeling groups performed many more simulations than in CMIP3 with a
corresponding increase in years simulated.
% the following increase in complexity doesn't help explain the 50-fold increase
% which is what this paragraph is supposed to address
% the number of experiments (or number of years simulated) are
% primarily controlled by $C$, which you say is limited to 8-fold increase.
% need to restructure the argument.
The second component of complexity is the experimental protocol, and
the number of experiments themselves when comparing CMIP5 and CMIP6.
With the new structure of CMIP6, with a DECK and 23 endorsed MIPs,
this
\pllabel{RC3-11}
has grown tremendously. We propose as a measure of experimental
complexity, the \emph{total number of simulated years (SYs)}
conforming to a given protocol. Note that this too is gated by $C$:
modeling centers usually make tradeoffs between experimental
complexity and resolution in deciding their level of participation
in CMIP6, discussed in \cite{ref:balajietal2017}.
\end{description}
Two further steps have been proposed toward ensuring sustainable
growth in data volumes.
% Given the earlier arguments, it seems $C$ will limit growth of volume by itself
% Why are additional steps necessary?
\pllabel{RC2-21}
The first of these is the consideration of standard horizontal
resolutions for saving data, as is already done for vertical and
temporal resolution in the data request. Cross-model analyses already
cast all data to a common grid in order to evaluate it as an ensemble,
typically at fairly low resolution. The studies of Knutti and
colleagues (e.g., \cite{ref:knuttietal2017}) are typically performed
on relatively coarse grids. Accordingly for most purposes
atmospheric data on the ERA-40 grid ($2^\circ\times 2.5^\circ$) would
suffice, with of course exceptions for experiments like those called
for by HighResMIP \citep{ref:haarsmaetal2016}. A similar
conclusion applies for ocean data (the World Ocean Atlas
$1^\circ\times 1^\circ$ grid), with extended discussion of the
benefits and losses due to regridding
\citep[see][]{ref:griffiesetal2014,ref:griffiesetal2016}.
\pllabel{RC3-14}
This has not been mandated for CMIP6 for a number of reasons. Firstly,
regridding is burdensome on many grounds: It requires considerable
expertise to choose appropriate algorithms for particular variables,
for instance, we may need ones that guarantee exact conservation for
scalars or preservation of streamlines for vector fields may be a
requirement; and it can be expensive in terms of computation and
storage. Secondly, regridding is irreversible (thus amounting to
``lossy'' data reduction) and non-commutative with certain basic
arithmetic operations such as multiplication (i.e., the product of
regridded variables does not in general equal the regridded output of
the product computed on the native grid). This can be problematic for
budget studies. However, the same issues would apply for
time-averaging and other operations long used in the field: much
analysis of CMIP output is performed on monthly-averaged data, which
is ``lossy'' compression along the time axis relative to the model's
time resolution.
These issues have contributed to a lack of consensus in moving forward,
and the recommendations on regridding remain in flux. The
\urlref{https://goo.gl/wVtm5t}{CMIP6 Output Grid Guidance document}
outlines a number of possible recommendations, including the provision
of ``weights'' to a target grid. Many of the considerations around
regridding, particularly for ocean data in CMIP6, are discussed at
length in \cite{ref:griffiesetal2016}.
There is a similar lack of consensus around common \emph{calendar} for
particular experiments.
\pllabel{RC3-13}
In cases such as a long-running control simulation where all years are
equivalent and of no historical significance, it is customary in this
community to use simplified calendars -- such as a Julian, a
``noleap'' (365-day) or ``equal-month'' (360-day) calendar -- rather
than the Gregorian, which can vastly simplify analysis. However,
comparison across datasets using incommensurate calendars can be a
frustrating burden on the end-user. There is no consensus at this
point on this issue.
As outlined below in Section~\ref{sec:replica}, both ESGF data nodes
and the creators of secondary repositories are given considerable
leeway in choosing data subsets for replication, based on their own
interests. The tracking mechanisms outlined in Section~\ref{sec:pid}
below will allow us to ascertain, after the fact, how widely used the
native grid data may be \emph{vis-\`a-vis} the regridded subset, and
allow us to recalibrate the replicas, as usage data becomes available.
We note also that the providers of at least one of the standard
metrics packages \citep[ESMValTool,][]{ref:eyringetal2016a} have
expressed a preference of standard grid data for their analysis, as
regridding from disparate grids increases the complexity of their
already overburdened infrastructure.
A second method of data reduction for the purposes of storage and
transmission, is the issue of data compression. netCDF4, which is the
recommended for CMIP6 data, includes an option for lossless
compression or deflation \citep{ref:zivlempel1977} that relies on the
same technique used in standard tools such as \texttt{gzip}. In
practice, the reduction in data volume will depend upon the
``entropy'' or randomness in the data, with smoother data being
compressed more.
Deflation entails computational costs, not only during creation of the
compressed data, but also every time the data are re-inflated. There
is also a subtle interplay with precision: for instance temperatures
usually seen in climate models appear to deflate better when expressed
in Kelvin, rather than Celsius, but that is due to the fact that the
leading order bits are always the same, and thus the data is actually
less precise. Deflation is also enhanced by reorganizing
(``shuffling'') the data internally into chunks that have spatial and
temporal coherence.
Some in the community argue for the use of more aggressive
\emph{lossy} compression methods \citep{ref:bakeretal2016}, but the
community, after consideration, believes the loss of precision
entailed by such methods, and the consequences for scientific results,
require considerably more evaluation by the community before such
methods can be accepted as common practice. However, as noted above,
some lossy methods of data reduction such as time-averaging, have long
been common practice.
Given the options above, we undertook a systematic study of the
behavior of typical model output files under lossless compression, the
results of which are \urlref{https://goo.gl/qkdDnn}{publicly available}.
The study indicates that standard \texttt{zlib} compression in the
netCDF4 library with the settings of \texttt{deflate=2} (relatively
modest, and computationally inexpensive), and \texttt{shuffle} (which
ensures better spatiotemporal homogeneity) ensures the best compromise
between increased computational cost and reduced data volume. For an
ESM,
\pllabel{RC1-25}
we expect a total savings of about 50\%, with ocean, ice, land realms
getting the most savings (owing to large areas of the globe that are
masked), and atmospheric data the least. This 50\% estimate has been
verified with sample output from some models preparing for CMIP6.
The \urlref{https://goo.gl/iNBQ9m}{DREQ} alluded to above in
Section~\ref{sec:dreq} allows us to make a systematic assessment of
these considerations. The tool expects one to input a model's
resolution along with the experiments that will be performed and the
data one intends to save (using DREQ's \emph{priority} attribute).
With this information
% We are actually capturing this information in the registered content
% for the model source_id entries - see http://rawgit.com/WCRP-CMIP/CMIP6_CVs/master/src/CMIP6_source_id.html
% The json entry contains resolutions for each active model realm
% https://github.com/WCRP-CMIP/CMIP6_CVs/blob/master/CMIP6_source_id.json
% "unprecedented" is incorrect.
% In CMIP5 we had a sophisticated capability of estimating data volume
% We polled the groups to determine which experiments they planned
% to run and how large their ensembles would be.
% We also asked what resolution they would report output.
% From this we estimated in Nov. 2010 a total data volume of 2.5 petabytes
% (2.1 petabytes if only high-priority variables were reported), not too
% far from the actual volume. I'll send you the analysis if you like.
% The modeling groups had access to this information.
\pllabel{RC2-23}
one may calculate the data volume that will be produced. For instance,
analyses available
\urlref{http://clipc-services.ceda.ac.uk/dreq/tab01_3_3.html}{DREQ
site} indicate that if a center were to undertake every single
experiment (all tiers) and save every single variable requested (all
priorities) at a ``typical'' resolution, it would generate about
800~TB of data, using the guidelines above. Given 75 participating
models, this translates to an upper bound of 60~PB for the entire
CMIP6 archive, though in practice most centers are planning to perform
a subset of experiments, and save a subset of variables, based on
their scientific priorities and available computational and storage
resources. The WIP carried out a survey of modeling centers in 2016,
asking them for their expected model resolutions, and intentions of
participating in various experiments. Based on that survey, we
initially have forecast a
\pllabel{RC1-27}
compressed data volume of 18~PB for CMIP6. This number, 18~PB, is
about 6 times the CMIP
\pllabel{RC1-28}