-
Notifications
You must be signed in to change notification settings - Fork 20
ESGF_Data_Download_Strategies
One very common requirement for climate scientists or data managers is to be able to efficiently find and download datasets containing specific variables, from distributed data centers located all around the globe, and produced by different models under the same forcing conditions. This is, in essence, what a Model Intercomparison Project (MIP) is all about. To be specific, let's consider the following use case in the context of the ongoing IPCC AR5 activities:
** Find and download the monthly mean atmospheric variables for the decadal experiments for all CMIP5 models and ensembles. **
Because of the very large number of files to be downloaded, it is very impractical to execute this use case from a traditional web browser user interface (although it is is still useful to start the data search there, just to figure out what's available). Luckily, the new ESGF Peer-To-Peer system allows the whole process to be completely scripted, so that it can be executed efficiently and repeatedly by a computer agent, as opposed to a human pointing and clicking...
The first step in realizing the use case is understanding the volume of the data involved, so that a proper download strategy can be devised. This can be done by interacting with the search service deployed on any of the federated ESGF nodes. The user can start typing URLs on the browser's URL bar, or he/she can write a script in their language of choice to execute HTTP requests to those URLs.
The following query will find all datasets in the system (i.e. across all ESGF sites) that are the latest version, not replicas, belong to the experiment family "Decadal" (which includes all CMIP5 decadal experiments), and contain monthly mean atmospheric variables. It will return an XML file that contains no results (limit=0), but it contains the total number of datasets in the system, and the values of model, experiment, ensemble, and id (the search _ facets _ or search categories) for the datasets that match the search criteria:
The above query matches (at the time of this writing) 2852 datasets. For each of the requested facets (the "facets=..." parameter), the returned values and counts indicate how many of the matching datasets fall in each category. This information can be extremely useful in deciding how to partition the total data volume in smaller subsets that are easier to manage and download.
The user could adopt one of many possible strategies to download the data: one dataset at a time, all datasets belonging to the same experiment and model, all datasets with the same model across all experiments, by variable, etc... As a practical consideration, it is a good idea to generate download scripts that contain no more than a few hundred files, so that the script can be easily monitored for successful completion, and restarted in case of failure (network problems, lack of proper authorization, expired certificate, etc.).
For example, the following query will identify a single dataset with a given value of the "id" facet (as retrieved from the previous query):
and the following query will select all _ files _ that compose that dataset (currently, 47 files):
(Note that typically the query for datasets is really fast, while the query for files is slower since the number of files in the system is much larger).
As a possible alternative, the following query:
will find all datasets that match the use case criteria, for the specified variable (specific humidity) and model - currently 286 results. The corresponding query for files is:
which also yields 286 results, because in this case each variable is contained in one file only.
These are only two possible download strategies - the user should feel free to come up with whatever strategy is best suited to the current availability of the data, the security setup, the science analysis they want to perform, etc.
Once the user has identified the collection of files they want to download, generating a corresponding wget script is as easy as changing the request URL from ".../search?type=File..." to ".../wget?...", keeping all other constraints unchanged (the "type=File" constraint is implicit in the generation of a wget script). For example, the following URLs will generate wget scripts corresponding to the collection of files identified before:
At this point, the user simply needs to run the scripts to download the data automatically, assuming he/she has previously registered with one of the ESGF Nodes, and has been granted access to CMIP5 data by clicking on the _ CMIP5 Research _ license agreement.
Happy downloads...
For more information on how to search and generate wget scripts within ESGF, aplease see:
-
** ESGF Search API ** search syntax and examples
-
** Search Controlled Vocabulary ** list of facets, reserved keywords and other constants used in searching
-
** Wget Scripts FAQ ** practical guide to running wget scripts generated by the web portal user interface
-
** Wget Scripting ** how to create wget scripts directly from the browser URL bar, bypassing the web portal user interface