Skip to content

Commit

Permalink
Merge pull request #79 from ACDguide/paola_create
Browse files Browse the repository at this point in the history
Paola create
  • Loading branch information
chloemackallah authored Oct 26, 2023
2 parents 83ba744 + bb03e43 commit d183ba9
Show file tree
Hide file tree
Showing 8 changed files with 191 additions and 197 deletions.
4 changes: 1 addition & 3 deletions Governance/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,7 @@ parts:
- file: tech/backup-checklist
- file: tech/cf-checker
- file: tech/contributors
- file: tech/drs
sections:
- file: tech/filenames
- file: tech/drs-names
- file: tech/keywords
- file: tech/coding
- file: tech/data_formats
Expand Down
2 changes: 1 addition & 1 deletion Governance/concepts/license-qa.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The license is enforceable in court, but clearly that's an extreme step. Usually
* <ins>How can my license be valid if a project or myself act as licensor when the copyright belongs to my institution?</ins><br>
If you are the creator of the data/code then you can apply a license on behalf of your institution. They won't mind as long as the license you are using is in line with their recommendations. Most Australian universities and the ARC, which funds most projects, require open access for any research product (unless there is a valid reason not to).<br>

*<ins>How can I license data partly derived from a "commercial" product?</ins><br>
* <ins>How can I license data partly derived from a "commercial" product?</ins><br>
You should first check if there is an agreement allowing you to use the data and if this agreement covers publishing derived data. If this is not in place a way around it could be to leave out the commercial data used in the project and substituted with a derived quantity.
In this [example](https://zenodo.org/record/4448518#.Y322MuxBz0o) the authors removed the wind speed mesaurements they used to identify a “severe wind event” and introduce a variable indicating if such event occured or not to ensure at least partial reproducibility.<br>

Expand Down
164 changes: 112 additions & 52 deletions Governance/create/create-basics.md

Large diffs are not rendered by default.

24 changes: 10 additions & 14 deletions Governance/create/create-intro.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,20 @@
# Guidelines to create a climate dataset

## UNDER DEVELOPMENT

## Scope of the guidelines

These guidelines cover the various aspects of creating robust and well-described climate data for reuse, analysis, sharing, and publication.

We have identified five primary use cases that guide the recommendations and requirements you should follow when creating your climate datasets:
1. for your own reuse and analysis (basic dataset needs)
2. sharing with colleagues for collaboration (minimum sharing recommendations, no citation necessary)
3. for publication alongside a research paper (journal requirements apply)
4. for publication into a large multi-institutional intercomparison project like CMIP (strict standards apply)
5. for productisation, including market-readiness and commercialisation (standards to be defined)
We have identified five primary use cases that guide the recommendations and requirements to follow when creating climate datasets:
1. Own reuse and analysis: basic dataset needs.
2. Sharing with colleagues for collaboration: minimum sharing recommendations, no citation necessary.
3. Publication alongside a research paper: journal requirements apply.
4. Publication into a specific project: project standards apply.
5. Productisation, including market-readiness and commercialisation: standards depend on audience and intended use.

Additionally, we have identified two main situations you may find yourself in: i) preparing your datasets from scratch (i.e. you have 'raw' data that is currently undescribed, and in a format that is not analysis-ready); or ii) deriving metrics or indices from a reference dataset (e.g. performing an analysis on CMIP data for a research publication). We will mostly be discussing the first situation where you are creating climate data from scratch, with specific recommendatations for the second situation later in the section.
We will mostly be discussing starting datasets from scratch from 'raw' data that is currently undescribed, and in a format that is not analysis-ready. Datasets can also be derived from existing data, as result of analysis or deriving metrics and indices from a reference dataset. We provide specific recommendations for the second situation later in the section.


## Index
* [Dataset creation basics & sharing recommendations](create-basics.md)
This is an overview of the landscape of climate datasets, including the various components of netCDF files and their storage in POSIX systems, and some best practice recommendations for the back up of data and management of the creation process.
* [Dataset creation basics](create-basics.md)
An overview of the landscape of climate datasets, including the various components of netCDF files and their storage in POSIX systems, and best practice recommendations for the backup of data and management of the creation process.

* File formats, metadata & coordinates
* File & directory organisation
Expand All @@ -36,7 +32,7 @@ This is the more practical description of how to create climate datasets (genera

&nbsp;
* [Requirements for publication & productisation](create-publishing.md)
This chapter outlines the standards for publication data that either accompanies a journal article or is submitted to an intercomparison project (e.g. CMIP), and some recommendations for tools to aid this process.
This chapter outlines the standards for publication data that either accompanies a journal article or is submitted to an intercomparison project (e.g., CMIP), and some recommendations for tools to aid this process.

* Publishing in a journal
* Submitting to an intercomparison project
Expand Down
8 changes: 7 additions & 1 deletion Governance/create/create-new-derived.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# New, modified, and derived datasets

## Creating new datasets from raw data
Paola (new comments following meeting Sep23):

We discussed here mentioning tools to generate/modify a netcdf file (ncdump/ncgen, nco to modify attributes, how xarray/matlab "create" netcdf file))
rather than trying to re-create every possible workflow.
As well as things a user should check to make sure they're following the reccomendations listed in create-basics. For example ar ethe attributes still relevant both at global and variable level?


Paola:
however rare, we could cover starting from a template, as for a cdl file (i.e. a ncdump output style file)
Expand Down Expand Up @@ -31,4 +37,4 @@ Make sure original attributes/documentation are still relevant
be careful particularly with units, cell_methods and coordinates that might have changed

Chloe:
Provenance: https://acdguide.github.io/Governance/concepts/provenance.html
Provenance: https://acdguide.github.io/Governance/concepts/provenance.html
125 changes: 0 additions & 125 deletions Governance/markdown.md

This file was deleted.

59 changes: 59 additions & 0 deletions Governance/tech/drs-names.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Choosing a directory structure and filenames

The names you choose for files and directories and generally the way you organise your data, i.e. your directory structure, can help navigating the data, provide extra information, avoid confusion and avoid the user ending up accessing the wrong data. In many cases the best file organisation will depend on the specific research project and the actual server where the data is stored. The global Climate Modelling Intercomparison Project (CMIP) has adopted a **Data Reference Syntax (DRS)**, based on the **controlled vocabularies (CVs)** used in model metadata, to define their file names and directory structures.
Here we list a few guidelines and tips to help you decide.

## General considerations
* Familiarise yourself with the storage system, make sure you are storing the files in the most appropriate place, get to know if the storage is backed up or not, check what your allocation is, and also what rules or best practices apply.
* Take into account how yourself or others might want to use the data, this is particularly important when deciding the DRS but also how to divide data across files for big datasets as model output. Doing so at the start of the project will spare you lots of time you might otherwise spend re-processing all your files.
* Be consistent, this applies both to the organisation and the naming, consistency is essential for the data to be machine-readable, i.e. data which is easy to access by coding. In fact, use community standards and/or controlled vocabularies wherever possible.
* Consider adding a `readme` file in the main directory, including an explanation of the DRS and the naming conventions, abbreviation and/or codes you used. If you used standards and controlled vocabularies all you have to do is to include a link to them.

## Directory structure

![Example of directory structure](../images/example_drs.png)

The figure above shows an example of an organised working directory for a model output.

**Things to consider:**

* Try to organise files in directories based on type and how you process them
for the final output. Also consider how others might use them: are they going to be used for analysis or they could be used as forcing or restart files for a model?
* If there is an existing DRS defined for an analogous data product (e.g. input data to a climate analysis workflow), would it help yourself and others to structure your output following a similar convention?
* Think of the way you would access these directories in a code, as an example having the variable directories using exactly the same name as the actual variable.
* Make sure your code is separate from your data, you want to be able to use something like git to version control it and possibly GitHub to back it up easily.
* Have at least one `readme` file with detailed metadata, possibly more if you have a lot of directories/files. You cannot realistically use git for managing versions of data but you can use git to version control your `readme` files.
* Review at regular intervals what you are keeping, what needs to be removed and how things are organised.

**Reference examples:**

* The CMIP6 DRS is defined in the [CMIP6 Controlled Vocabularies document](https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk/edit), starting on p.13.
* The [CORDEX DRS](http://is-enes-data.github.io/CORDEX_adjust_drs.pdf) builds on the CMIP DRS to apply to regional climate models.

## File naming
You can use filenames to include information as:

* project, simulation and/or experiment acronyms, you might have to use a combination of them
* spatial coverage: the region or coordinates range covered by the data, could also be a specific domain for climate model data, e.g., ocean, land etc.
* grid: could be either a grid label or spatial resolution
* temporal coverage: a specific year/date or a temporal range
* temporal frequency: monthly, daily etc.
* type of data: again this depends on context, if the same directory contains data from different instrumentations it is important to specify the instrument in the name. For coupled model output this could be the model component, if you are using one file per variable, the variable name
* version: this is really important if you are sharing the data even if only 1 version exists at the time
* correct file extension

# Tips for machine-readable files
* avoid special characters: ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ “
* do not use spaces to separate words; use underscores "_" or dashes "-" or CamelCase
* use YYYYMMDD for dates, it will sort your files in chronological order, absolutely avoid "Jan, Feb, .." for months as they are much harder to code for.
* for number sequences, use leading zeros: so 001, 002,.. 020,.. 103 rather than 1, 2,.. 20, .. 103
* try to avoid overly long names - for a single file directory keep it under 255 characters, for paths 30000.
* avoid having a large number of files in a single directory, but also an excessive number of directories with one file each
* always include file extension, some software can recognise files from their header, but this is not always the case

## Online Resources
We partially based this page on the resources listed below, and recommend checking them for more insight and advice.

* [Best practice to organise your data](https://www.earthdatascience.org/courses/intro-to-earth-data-science/open-reproducible-science/get-started-open-reproducible-science/best-practices-for-organizing-open-reproducible-science/) - part of an Open reproducible science course from the University of Colorado
* [Software Carpentry video covering DRS best practices](https://youtu.be/3MEJ38BO6Mo)
* [Best file naming practice handout (pdf) from Standford University](https://stanford.box.com/shared/static/yl5a04udc7hff6a61rc0egmed8xol5yd.pdf)
2 changes: 1 addition & 1 deletion Governance/tech/drs.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Choosing a directory structure (DRS) and filenames
# Choosing a directory structure: DRS and filenames

The names you choose for files and directories and generally the way you organise your data, i.e. your directory structure, can help navigating the data, provide extra information, avoid confusion and avoid the user ending up accessing the wrong data. In many cases the best file organisation will depend on the specific research project and the actual server where the data is stored. The global climate modelling intercomparison project (CMIP) has adopted a **Data Reference Syntax (DRS)**, based on the **controlled vocabularies (CVs)** used in model metadata, to define their file names and directory structures.
Here we list a few guidelines and tips to help you decide.
Expand Down

0 comments on commit d183ba9

Please sign in to comment.