diff --git a/Governance/_toc.yml b/Governance/_toc.yml index ac0c70e..e157e51 100644 --- a/Governance/_toc.yml +++ b/Governance/_toc.yml @@ -91,9 +91,7 @@ parts: - file: tech/backup-checklist - file: tech/cf-checker - file: tech/contributors - - file: tech/drs - sections: - - file: tech/filenames + - file: tech/drs-names - file: tech/keywords - file: tech/coding - file: tech/data_formats diff --git a/Governance/concepts/license-qa.md b/Governance/concepts/license-qa.md index d561091..0551e27 100644 --- a/Governance/concepts/license-qa.md +++ b/Governance/concepts/license-qa.md @@ -11,7 +11,7 @@ The license is enforceable in court, but clearly that's an extreme step. Usually * How can my license be valid if a project or myself act as licensor when the copyright belongs to my institution?
If you are the creator of the data/code then you can apply a license on behalf of your institution. They won't mind as long as the license you are using is in line with their recommendations. Most Australian universities and the ARC, which funds most projects, require open access for any research product (unless there is a valid reason not to).
-*How can I license data partly derived from a "commercial" product?
+* How can I license data partly derived from a "commercial" product?
You should first check if there is an agreement allowing you to use the data and if this agreement covers publishing derived data. If this is not in place a way around it could be to leave out the commercial data used in the project and substituted with a derived quantity. In this [example](https://zenodo.org/record/4448518#.Y322MuxBz0o) the authors removed the wind speed mesaurements they used to identify a “severe wind event” and introduce a variable indicating if such event occured or not to ensure at least partial reproducibility.
diff --git a/Governance/create/create-basics.md b/Governance/create/create-basics.md index 406a072..91ea402 100644 --- a/Governance/create/create-basics.md +++ b/Governance/create/create-basics.md @@ -1,82 +1,126 @@ -# Dataset creation basics & sharing recommendations +# Dataset creation basics -See https://github.com/ACDguide/Governance/issues/7 for discussion and suggestions +Climate data is a highly specialised field of data science, due to the size and complexity which often require computing and scientific coding skills. A variety of metadata fields are required to adequately describe the data and its dimensions. Domain-specific scientific knowledge is required to make informed decisions about its creation and use, and technical knowledge is required to produce robust datasets that can be reused by others. +Many of the terms and concepts used are described in more detail in the [Concepts](../concepts/concept-intro.md) and [Technical tips](../tech/tech-intro.md) appendices. - -## File formats, metadata & coordinates - -Climate data is a highly specialised field of data science, due to the size and complexity which often require computing and scientific coding skills, the variety of metadata fields required to adequately describe the data and its dimensions, the domain-specific scientific knowledge required to make informed decisions about its creation and use, and the technical knoweldge required to produce robust datasets that can be reused by others. (Note that we will use terms and concepts described in the appendix [Concepts](../concepts/concept-intro.md).) - -By far the most commonly used format in the climate science community is [netCDF](https://www.unidata.ucar.edu/software/netcdf/), an open (i.e. not proprietary), self-describing (i.e. metadata is an in-built feature) array-oriented (i.e. highly structured data such as climate model data) format that is typically used on POSIX systems (Unix-based computing system with a directory structure and command-line input; standard for high-performance computing systems such as [NCI](https://nci.org.au/)). See the [technical note on data formats](../tech/data_formats.md). - -### Components of a NetCDF file +## NetCDF format +By far the most commonly used format in the climate science community is [netCDF](../tech/data_formats.md): (maybe we should cross reference big data guide here) +* open, +* self-describing: metadata is an in-built feature, +* array-oriented, + format. NetCDF files contain three main components: -* Dimensions describe the overall array structure of the data stored in the file, though not every variable must use all dimensions that exist in the file. Dimensions can be ‘real’ dimensions (such as time, latitude, longitude), or ‘pseudo’ dimensions (such as land-use tiles, or spectral bands). NetCDF dimensions, however, contain no metadata or actual values, which are described using variables with the same name. The dimensions are the base architecture of the file. +* Dimensions describe the overall array structure of the data stored in the file, though variables can have different dimensions. Dimensions can be ‘real’ dimensions (such as time, latitude, longitude), or ‘pseudo’ dimensions (such as land-use tiles, or spectral bands). NetCDF dimensions, however, contain no metadata or actual values, which are instead described using variables with the same name. The dimensions are the base architecture of the file. ```{note} -Technically, many dimensions can be created in a NetCDF file, including multiple time (e.g. time0, time 1, etc) or lat/lon (e.g. lat_a, lat_c) dimensions if you choose. -However, it is recommended to minimise the use of multiple highly similar dimensions; particularly 'time', as there is often a hard-coded expectation in analysis/visualisation packages that expect one and only one time axis. +Technically, many dimensions can be created in a netCDF file, including multiple time or lat/lon dimensions. +However, it is recommended to minimise the use of multiple highly similar dimensions; particularly 'time', as often analysis/visualisation packages cannot handle multiple time axis and having more than one might produce errors or unexpected results. ``` -* Variables (usually represented with floating point values) contain the actual geospatial data that you are interested in storing and sharing. NetCDF variables can be either your specific scientific information (e.g. surface temperatures on a lat/lon grid), or the value description of the array dimensions (e.g. timestamps for a time dimension). Each variable is defined along one or more dimension, and has associated attributes in the form of key:value pairs. These attributes can be titled using any string or value, however there are some common standards (e.g. [CF conventions](../concepts/cf-conventions.md)) that we highly recommend using. +* Variables contain the actual value arrays and metadata used to describe them. NetCDF variables can be used for actual geospatial data (e.g., surface temperatures on a lat/lon grid), or to store the dimensions arrays and definitions (e.g., timestamps for a time dimension). Each variable is defined along one or more dimension and has associated attributes in the form of {key, value} pairs. Attributes and variable names must be strings, while there are only few restrictions on names, there are common standards (e.g., [CF conventions](../concepts/cf-conventions.md)) that we highly recommend using. + -* Global attributes are key:value pairs that descibe the file at the top-level. While these are typically chosen according to the use case of the data and can vary significantly between modelling realms or scientific need, standards also exist for these. Common global attributes include dataset title, provenance information (i.e. where the data came from), license, and contact information, as well as naming any metadata conventions implemented in the file. +* Global attributes are {key: value} pairs that describe the overall file. These are typically chosen according to the kind of data, the way it was generated and its potential uses. However, standards such as the [ACDD conventions](../concepts/acdd-conventions.md) also exist to cover common global attributes. These include dataset title, [provenance information](../concepts/provenance.md), license, and contact information, as well as naming any metadata conventions implemented in the file. -It is likely that the raw scientific data that you are building your datasets around are already highly structured (even in netCDF format already), so your main effort here will be ensuring that the metadata correctly and adequately describes your data. +```{note} +Raw scientific data is usually highly structured and likely in netCDF format already, so often the main effort required is to ensure that the attributes describe the data correctly and adequately. +``` ### Attributes NetCDF metadata attributes are generally in one of two categories: machine-readable and human-readable (though these overlap significantly). -* Machine-readable attributes (e.g., `units`, `standard_name`, `missing_value` and `calendar`) typically describe the data itself, are usually variable-level, and can be automatically interpreted by standard plotting and analysis tools if set according to common [conventions](../concepts/conventions.md) and [controlled vocabularies](../concepts/controlled-vocab.md). These conventions typically contain suggestions for the attribute key and value using commonly understood terms, and are highly recommended to enable analysis and visualisation with standard software packages. - -* Human-readable attributes (e.g. `title`, `institution`, `license` and `long_name`) are fields that contain free strings and are interpreted by the user. Often global-level, these tend to describe the larger context in which the dataset sits, and enable the user to understand where the data came from, how it was generated, and enables both reuse and reproduction. These attributes could document the climate model used to generate the data, the project in which the data generation was conducted, the contact information of the dataset creator or manager, or a list of [keywords](../tech/keywords.md) similar to those in a journal publication. Conventions usually contain suggested keys to use, with the values defined according to your use case. - -For a list of recommended attributes to include and define in most climate datasets and how to apply them, see the [conventions page in this book](../tech/conventions.md). We recommend implementing these metadata fields into your post-processing workflow so that these are automatically applied by your data creation code/script. - -For a more technical description of the netCDF format and metadata, see https://acdguide.github.io/BigData/data/data-netcdf.html. - -### Example metadata using ncdump - -Ncdump from a simple-ish file. Not CMIP6, coz I want to save that for later; prefer something less complex. +* Machine-readable attributes (e.g., `units`, `standard_name`, `missing_value` and `calendar`) typically describe the data itself, are usually variable-level, and can be automatically interpreted by standard plotting and analysis tools if set according to common [conventions](../concepts/conventions.md) and [controlled vocabularies](../concepts/controlled-vocab.md). These conventions typically contain suggestions for the attribute key and value using commonly understood terms and are highly recommended to enable analysis and visualisation with standard software packages. + +* Human-readable attributes (e.g., `title`, `institution`, `license` and `long_name`) are fields that contain free strings and are interpreted by the user. Often global-level, these tend to describe the larger context in which the dataset sits, and enable the user to understand where the data came from, how it was generated, and enables both reuse and reproduction. These attributes could document the climate model used to generate the data, the project in which the data generation was conducted, the contact information of the dataset creator or manager, or a list of [keywords](../tech/keywords.md) similar to those in a journal publication. + +CF conventions cover both variable-level and global attributes, while the ACDD conventions are an extension covering mostly 'human-readable' information. + +````{note} +We give a detailed overview of how to write CF compliant files in the [Technical tips](../tech/conventions.md) appendix. This includes known issues that can be caused by not following the standards, as CF conventions are used by developers of tools that access and analyse netCDF data to make assumptions on the data structure. We recommend implementing these metadata fields in the post-processing workflow so that these are automatically generated when possible. For a more technical description of the netCDF format and metadata, see the [ACDG guidelines on BigData](https://acdguide.github.io/BigData/data/data-netcdf.html). +```` + +:::{dropdown} Example of netCDF file which adheres to CF and ACDD conventions +netcdf heatflux {
+    dimensions:
+        lon = 1440 ;
+        lat = 720 ;
+        time = 12 ;
+    variables:
+        double lon(lon) ;
+            lon:units = "degrees_east" ;
+            lon:long_name = "longitude" ;
+            lon:standard_name = "longitude" ;
+        double lat(lat) ;
+            lat:units = "degrees_north" ;
+            lat:long_name = "latitude" ;
+            lat:standard_name = "latitude" ;
+        double time(time) ;
+            time:units = "days since 1990-1-1 0:0:0" ;
+            time:long_name = "time" ;
+            time:calendar = "gregorian" ;
+            time:standard_name = "time" ;
+        float hfls(time, lat, lon) ;
+            hfls:units = "W m-2" ;
+            hfls:_FillValue = NaNf ;
+            hfls:long_name = "latent heat flux" ;
+            hfls:standard_name = "surface_upward_latent_heat_flux" ;
+            hfls:ALMA_short_name = "Qle" ;
+        float hfls_sd(time, lat, lon) ;
+            hfls_sd:units = "W m-2" ;
+            hfls_sd:_FillValue = NaNf ;
+            hfls_sd:long_name = "error (standard deviation) of latent heat flux" ;
+            hfls_sd:standard_name = "surface_upward_latent_heat_flux" ;
+            hfls_sd:cell_methods = "area: standard_deviation" ;
+ +// global attributes:
+    :Conventions = "CF-1.7, ACDD-1.3" ;
+    :title = "Global surface latent heat flux from reanalysis and observations" ;
+    :product_version = "1.0" ;
+    :summary = "Surface latent heatflux dataset with error estimates derived from reanalysis and observations" ;
+    :source = "Reanalysis: ...; Observations: ...";
+    :creator_name = "author" ;
+    :contact = "author@uni.edu" ;
+    :contributor_name = "data manager" ;
+    :contributor_role = "curator" ;
+    :contributor_email = "curator@uni.edu" ;
+    :institution = "University of ..." ;
+    :organisation = "Centre for ..." ;
+    :id = "https://doi.org/10.12345/dfg56th7" ;
+    :date_created = "2023-04-15" ;
+    :license = "http://creativecommons.org/licenses/by/4.0/" ;
+    :keywords = "040105 Climatology (excl. Climate Change Processes) and 040608 Surfacewater Hydrology" ;
+    :references = "Author, 2023. Global surface latent heat flux from reanalysis and observations v1.0. Publisher, (Dataset), doi:10.12345/dfg56th7" ;
+    :time_coverage_start = "1990-01-01" ;
+ :time_coverage_end = "2022-12-31" ;
+    :geospatial_lat_min = "-90" ;
+    :geospatial_lat_max = "90" ;
+    :geospatial_lon_min = "-180" ;
+    :geospatial_lon_max = "180" ;
+    :history = "nccat hfls*.nc heatflux.nc" ;
+::: ## File & directory organisation -Climate datasets are complex and can be chopped up and stored in many different ways. For example, datasets can broken up into separate files that contain full timeseries of a single variable, or each file could contain one month of data for all relevant variables. Data files should be structured into a navigable directory structure that sorts the files into some high-level dimensions that make it easier to access and understand what different data the set contains, with a directory tree that is meaningful and interpretable, and keeps the number of individual files in a given directory to below ~1000. Depending on how many files are produced, implementing a directory structure early before the number of files become hard to track, is recommended. Some suggestions for directory tree components/dimensions are: variable, frequency, modelling realm, experiment name, or governing project. - -There is a type of complex directory structure known as a '[Directory Reference Syntax](../tech/drs.md)' (DRS), which is typically standardised through intercomparison projects such as CMIP and CORDEX. The directory tree components are usually very broad and cover many contextual aspects, especially when multiple models or modelling institutions are using the same directory structure and controlled vocabulary. - -Filenaming is an important consideration here, as confusion can easily arise if you have not named your files with enough verbosity to disentangle similar, but critically different files. A common recommendation is to name the files in a similar way to the directory structure (e.g. if you have both monthly and daily data, put these into two separate sub-directories and include the frequency in the filenames). Other provenance details, such as experiment name and model, should be considered in the filename itself to reduce the risk of confusing different outputs. For example, a file called 'ocean_2014.nc' can be misunderstood very easily, but a file called 'ACCESS-ESM_historical-pacemaker_ocean_monthly_2014.nc' is much clearer and will reduce the risk of having to rerun models, or misplacing irreplaceable observational data. -See [this page](../tech/filenames.md) for some tips to creating a robust filenaming convention for your datasets. - - -## Backups & archiving - -Climate data can often be difficult (or impossible) to regenerate, due to large compute costs and non-repeatable conditions. A good backup strategy is vital to ensuring that the risk of data loss is minimised, and storage/compute resources are used efficiently. It is also important to note that NCI's `/g/data` storage system is NOT backed up. - -Our general recommendations are: -* keep only data intended for sharing in common areas. -* working data should be restricted to your personal space, or a defined collaborative working space. -* ancillary data (model input/config files, and other data that is not being actively used) tarred and archived into longer-term storage. -* raw data (e.g. unprocessed or semi-processed model output) should be backed up onto a tape system (e.g. NCI's [MDSS](../tech/massdata.md)) to enable regeneration of processed datasets from the raw data, without having to rerun models. -* a backup strategy should be set-up and implented early (ideally as part of a data management plan; see next section). - -For more detailed guidance on backing up data, see our [guide to creating a backup strategy](../concepts/backup.md) and [backup checklist](../tech/backup-checklist.md). +Climate datasets can be complex and often too big to be stored in a single file. +Raw output refers to the files generated by a model, analysis workflow, or instrument. Raw output is optimised to the tool that produced it. For example, models often output all variables at a single timestep in one file. This is optimal for the model but not necessarily for analysis and long-term storage. +Data files should be structured into a navigable directory structure that describes what different data the set contains. Ideally a directory tree should be meaningful and interpretable, with less than ~1000 individual files in each directory. Depending on how many files are produced, implementing a directory structure early before the number of files become hard to track, is recommended. +Data Reference Syntax (DRS) is a naming system to be used within files, directories, and metadata to identify data sets. DRS were first established by intercomparison projects such as CMIP and CORDEX based on Controlled Vocabularies (CV). While aspects of these DRS don't apply to smaller datasets, they offer a useful framework to organise climate data and choosing names that are meaningful and recognisable. -Moving data between disks (e.g. from NCI's `/scratch` to `/g/data/`) and systems (e.g. from NCI to public cloud) can be challenging, especially for datasets at the TB-scale. We recommend using [rsync](https://rsync.samba.org/) wherever possible, because it contains a large amount of flexibility (useful for the variety of use cases when moving data), and is generally very stable (stability is a major issue when moving data between systems). For more guidance on moving large data, see the [Moving Data page](../tech/moving-data.md). +File naming is also important, as confusion can easily arise if names are not sufficiently descriptive. A common recommendation is to name the files in a similar way to the directory structure. For example, if you have both monthly and daily data, put these into two separate sub-directories and include the frequency in the filenames. Other provenance details, such as experiment name and model, should be considered in the filename itself to reduce the risk of confusing different outputs. For example, a file called 'ocean_2014.nc' can be misunderstood very easily, but a file called 'ACCESS-ESM_historical-pacemaker_ocean_monthly_2014.nc' is much clearer and will reduce the risk of having to rerun models or misplacing irreplaceable observational data. +See [this page](../tech/drs-names.md) for some tips to creating a robust directory structure and filenaming convention for your datasets. ## Data management plans & documentation A **Data Management Plan** (DMP) is a general term for a document that describes the intended methods of the creation, storage, management and distribution of a given collection of data, and defines or cites the rules, policies or principles that govern the dataset. DMPs can vary greatly depending on context, the type of data, or intended audience. A DMP is also a living document, one that evolves through the various stages of the project in which the data is created. -Generally, however, a DMP should provide guidance to data managers, in order to help inform decision making. E.g., where should a new dataset be stored, who should have access to it, when should it be deleted. In the case where decisions are not clearly indicated from the DMP, it should indicate who is responsible for making the decision. -Ideally, a DMP is prepared as an integral part of project planning, with a data custodian also responsible for it's continued development. An initial DMP can be as simple as notes in a text file, and include basic information such as backup locations, input files, tools used, and the intended use of the final dataset. Additionally, file metadata such as licences (see https://acdguide.github.io/Governance/concepts/license.html) and contact information are regularly included in DMPs. +Generally, however, a DMP should provide guidance to data managers to inform decision making. E.g., where should a new dataset be stored, who should have access to it, when should it be deleted. In the case where decisions are not clearly indicated from the DMP, it should indicate who is responsible for making the decision. +Ideally, a DMP is prepared as an integral part of project planning, with a data custodian also responsible for its continued development. An initial DMP can be as simple as notes in a text file, and include basic information such as backup locations, input files, tools used, and the intended use of the final dataset. Additionally, file metadata such as licences (see https://acdguide.github.io/Governance/concepts/license.html) and contact information are regularly included in DMPs. For more information on Data Management Plans, see https://acdguide.github.io/Governance/concepts/dmp.html -While similar to DMP in many ways, **data documentation** is a distinct purpose in that it provides guidance to users of the data (rather than managers of the data), including those who intend to reproduce it. Data documentation will include many of the same information as a DMP, such as the method of data generation (input files, software used), distribution details, and project context. Data documentation are typically kept alongside the dataset as in a README file at the top level directory, which provide high-level information about the dataset (e.g., when it was created, who to contact, and how to use it). However, data documentation is a general term for 'user guidance of a dataset', and can also be prepared in the form of journal articles that provide much more detail. In cases where the data itself is not self-describing (e.g. CSV files), data documentation will need to provide low-level metadata such as dimensions and units. - +While similar to a DMP in many ways, **data documentation** is a distinct purpose in that it provides guidance to users of the data (rather than managers of the data), including those who intend to reproduce it. Data documentation will include many of the same information as a DMP, such as the method of data generation (input files, software used), distribution details, and project context. Data documentation is typically kept alongside the dataset as in a README file at the top level directory, which provide high-level information about the dataset (e.g., when it was created, who to contact, and how to use it). However, data documentation is a general term for 'user guidance of a dataset' and can also be prepared in the form of journal articles that provide much more detail. In cases where the data itself is not self-describing (e.g., CSV files), data documentation will need to provide low-level metadata such as dimensions and units. ## Code management & version control @@ -89,3 +133,19 @@ Of course, code cannot exclusively exist in a repository. It is suggested (parti Data should also be versioned, especially for more underpinning datasets such as model output & data products, however best practice in this domain is still evolving. CMIP data includes versioning at the variable level that uses date of file creation, however this is just one method. For more information on versioning, see https://acdguide.github.io/Governance/tech/versioning.html + +## Backups & archiving + +Climate data can often be difficult (or impossible) to regenerate, due to large compute costs and non-repeatable conditions. A good backup strategy is vital to ensuring that the risk of data loss is minimised, and storage/compute resources are used efficiently. It is also important to note that NCI's `/g/data` storage system is **not** backed up. + +Our general recommendations are: +* keep only data intended for sharing in common areas. +* working data should be restricted to your personal space, or a defined collaborative working space. +* ancillary data (model input/config files, and other data that is not being actively used) tarred and archived into longer-term storage. +* raw data (e.g., unprocessed or semi-processed model output) should be backed up onto a tape system (e.g., NCI's [MDSS](../tech/massdata.md)) to enable regeneration of processed datasets from the raw data, without having to rerun models. +* a backup strategy should be set-up and implemented early (ideally as part of a data management plan; see next section). + +For more detailed guidance on backing up data, see our [guide to creating a backup strategy](../concepts/backup.md) and [backup checklist](../tech/backup-checklist.md). + +Moving data between disks (e.g., from NCI's `/scratch` to `/g/data/`) and systems (e.g., from NCI to public cloud) can be challenging, especially for datasets at the TB-scale. We recommend using [rsync](https://rsync.samba.org/) wherever possible, because it contains a large amount of flexibility (useful for the variety of use cases when moving data), and is generally very stable (stability is a major issue when moving data between systems). For more guidance on moving large data, see the [Moving Data page](../tech/moving-data.md). + diff --git a/Governance/create/create-intro.md b/Governance/create/create-intro.md index ea1a14d..7ca4158 100644 --- a/Governance/create/create-intro.md +++ b/Governance/create/create-intro.md @@ -1,24 +1,20 @@ # Guidelines to create a climate dataset -## UNDER DEVELOPMENT - -## Scope of the guidelines - These guidelines cover the various aspects of creating robust and well-described climate data for reuse, analysis, sharing, and publication. -We have identified five primary use cases that guide the recommendations and requirements you should follow when creating your climate datasets: -1. for your own reuse and analysis (basic dataset needs) -2. sharing with colleagues for collaboration (minimum sharing recommendations, no citation necessary) -3. for publication alongside a research paper (journal requirements apply) -4. for publication into a large multi-institutional intercomparison project like CMIP (strict standards apply) -5. for productisation, including market-readiness and commercialisation (standards to be defined) +We have identified five primary use cases that guide the recommendations and requirements to follow when creating climate datasets: +1. Own reuse and analysis: basic dataset needs. +2. Sharing with colleagues for collaboration: minimum sharing recommendations, no citation necessary. +3. Publication alongside a research paper: journal requirements apply. +4. Publication into a specific project: project standards apply. +5. Productisation, including market-readiness and commercialisation: standards depend on audience and intended use. -Additionally, we have identified two main situations you may find yourself in: i) preparing your datasets from scratch (i.e. you have 'raw' data that is currently undescribed, and in a format that is not analysis-ready); or ii) deriving metrics or indices from a reference dataset (e.g. performing an analysis on CMIP data for a research publication). We will mostly be discussing the first situation where you are creating climate data from scratch, with specific recommendatations for the second situation later in the section. +We will mostly be discussing starting datasets from scratch from 'raw' data that is currently undescribed, and in a format that is not analysis-ready. Datasets can also be derived from existing data, as result of analysis or deriving metrics and indices from a reference dataset. We provide specific recommendations for the second situation later in the section. ## Index -* [Dataset creation basics & sharing recommendations](create-basics.md) -This is an overview of the landscape of climate datasets, including the various components of netCDF files and their storage in POSIX systems, and some best practice recommendations for the back up of data and management of the creation process. +* [Dataset creation basics](create-basics.md) +An overview of the landscape of climate datasets, including the various components of netCDF files and their storage in POSIX systems, and best practice recommendations for the backup of data and management of the creation process. * File formats, metadata & coordinates * File & directory organisation @@ -36,7 +32,7 @@ This is the more practical description of how to create climate datasets (genera   * [Requirements for publication & productisation](create-publishing.md) -This chapter outlines the standards for publication data that either accompanies a journal article or is submitted to an intercomparison project (e.g. CMIP), and some recommendations for tools to aid this process. +This chapter outlines the standards for publication data that either accompanies a journal article or is submitted to an intercomparison project (e.g., CMIP), and some recommendations for tools to aid this process. * Publishing in a journal * Submitting to an intercomparison project diff --git a/Governance/create/create-new-derived.md b/Governance/create/create-new-derived.md index 8e3579b..6e51882 100644 --- a/Governance/create/create-new-derived.md +++ b/Governance/create/create-new-derived.md @@ -1,6 +1,12 @@ # New, modified, and derived datasets ## Creating new datasets from raw data +Paola (new comments following meeting Sep23): + +We discussed here mentioning tools to generate/modify a netcdf file (ncdump/ncgen, nco to modify attributes, how xarray/matlab "create" netcdf file)) +rather than trying to re-create every possible workflow. +As well as things a user should check to make sure they're following the reccomendations listed in create-basics. For example ar ethe attributes still relevant both at global and variable level? + Paola: however rare, we could cover starting from a template, as for a cdl file (i.e. a ncdump output style file) @@ -31,4 +37,4 @@ Make sure original attributes/documentation are still relevant be careful particularly with units, cell_methods and coordinates that might have changed Chloe: -Provenance: https://acdguide.github.io/Governance/concepts/provenance.html \ No newline at end of file +Provenance: https://acdguide.github.io/Governance/concepts/provenance.html diff --git a/Governance/markdown.md b/Governance/markdown.md deleted file mode 100644 index 1cc9c34..0000000 --- a/Governance/markdown.md +++ /dev/null @@ -1,125 +0,0 @@ -# Markdown Files - -Whether you write your book's content in Jupyter Notebooks (`.ipynb`) or -in regular markdown files (`.md`), you'll write in the same flavor of markdown -called **MyST Markdown**. - -## What is MyST? - -MyST stands for "Markedly Structured Text". It -is a slight variation on a flavor of markdown called "CommonMark" markdown, -with small syntax extensions to allow you to write **roles** and **directives** -in the Sphinx ecosystem. - -## What are roles and directives? - -Roles and directives are two of the most powerful tools in Jupyter Book. They -are kind of like functions, but written in a markup language. They both -serve a similar purpose, but **roles are written in one line**, whereas -**directives span many lines**. They both accept different kinds of inputs, -and what they do with those inputs depends on the specific role or directive -that is being called. - -### Using a directive - -At its simplest, you can insert a directive into your book's content like so: - -```` -```{mydirectivename} -My directive content -``` -```` - -This will only work if a directive with name `mydirectivename` already exists -(which it doesn't). There are many pre-defined directives associated with -Jupyter Book. For example, to insert a note box into your content, you can -use the following directive: - -```` -```{note} -Here is a note -``` -```` - -This results in: - -```{note} -Here is a note -``` - -In your built book. - -For more information on writing directives, see the -[MyST documentation](https://myst-parser.readthedocs.io/). - - -### Using a role - -Roles are very similar to directives, but they are less-complex and written -entirely on one line. You can insert a role into your book's content with -this pattern: - -``` -Some content {rolename}`and here is my role's content!` -``` - -Again, roles will only work if `rolename` is a valid role's name. For example, -the `doc` role can be used to refer to another page in your book. You can -refer directly to another page by its relative path. For example, the -role syntax `` {doc}`intro` `` will result in: {doc}`intro`. - -For more information on writing roles, see the -[MyST documentation](https://myst-parser.readthedocs.io/). - - -### Adding a citation - -You can also cite references that are stored in a `bibtex` file. For example, -the following syntax: `` {cite}`holdgraf_evidence_2014` `` will render like -this: {cite}`holdgraf_evidence_2014`. - -Moreoever, you can insert a bibliography into your page with this syntax: -The `{bibliography}` directive must be used for all the `{cite}` roles to -render properly. -For example, if the references for your book are stored in `references.bib`, -then the bibliography is inserted with: - -```` -```{bibliography} -``` -```` - -Resulting in a rendered bibliography that looks like: - -```{bibliography} -``` - - -### Executing code in your markdown files - -If you'd like to include computational content inside these markdown files, -you can use MyST Markdown to define cells that will be executed when your -book is built. Jupyter Book uses *jupytext* to do this. - -First, add Jupytext metadata to the file. For example, to add Jupytext metadata -to this markdown page, run this command: - -``` -jupyter-book myst init markdown.md -``` - -Once a markdown file has Jupytext metadata in it, you can add the following -directive to run the code at build time: - -```` -```{code-cell} -print("Here is some code to execute") -``` -```` - -When your book is built, the contents of any `{code-cell}` blocks will be -executed with your default Jupyter kernel, and their outputs will be displayed -in-line with the rest of your content. - -For more information about executing computational content with Jupyter Book, -see [The MyST-NB documentation](https://myst-nb.readthedocs.io/). diff --git a/Governance/tech/drs-names.md b/Governance/tech/drs-names.md new file mode 100644 index 0000000..4457ae4 --- /dev/null +++ b/Governance/tech/drs-names.md @@ -0,0 +1,59 @@ +# Choosing a directory structure and filenames + +The names you choose for files and directories and generally the way you organise your data, i.e. your directory structure, can help navigating the data, provide extra information, avoid confusion and avoid the user ending up accessing the wrong data. In many cases the best file organisation will depend on the specific research project and the actual server where the data is stored. The global Climate Modelling Intercomparison Project (CMIP) has adopted a **Data Reference Syntax (DRS)**, based on the **controlled vocabularies (CVs)** used in model metadata, to define their file names and directory structures. +Here we list a few guidelines and tips to help you decide. + +## General considerations +* Familiarise yourself with the storage system, make sure you are storing the files in the most appropriate place, get to know if the storage is backed up or not, check what your allocation is, and also what rules or best practices apply. +* Take into account how yourself or others might want to use the data, this is particularly important when deciding the DRS but also how to divide data across files for big datasets as model output. Doing so at the start of the project will spare you lots of time you might otherwise spend re-processing all your files. +* Be consistent, this applies both to the organisation and the naming, consistency is essential for the data to be machine-readable, i.e. data which is easy to access by coding. In fact, use community standards and/or controlled vocabularies wherever possible. +* Consider adding a `readme` file in the main directory, including an explanation of the DRS and the naming conventions, abbreviation and/or codes you used. If you used standards and controlled vocabularies all you have to do is to include a link to them. + +## Directory structure + +![Example of directory structure](../images/example_drs.png) + +The figure above shows an example of an organised working directory for a model output. + +**Things to consider:** + +* Try to organise files in directories based on type and how you process them +for the final output. Also consider how others might use them: are they going to be used for analysis or they could be used as forcing or restart files for a model? +* If there is an existing DRS defined for an analogous data product (e.g. input data to a climate analysis workflow), would it help yourself and others to structure your output following a similar convention? +* Think of the way you would access these directories in a code, as an example having the variable directories using exactly the same name as the actual variable. +* Make sure your code is separate from your data, you want to be able to use something like git to version control it and possibly GitHub to back it up easily. +* Have at least one `readme` file with detailed metadata, possibly more if you have a lot of directories/files. You cannot realistically use git for managing versions of data but you can use git to version control your `readme` files. +* Review at regular intervals what you are keeping, what needs to be removed and how things are organised. + +**Reference examples:** + +* The CMIP6 DRS is defined in the [CMIP6 Controlled Vocabularies document](https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk/edit), starting on p.13. +* The [CORDEX DRS](http://is-enes-data.github.io/CORDEX_adjust_drs.pdf) builds on the CMIP DRS to apply to regional climate models. + +## File naming +You can use filenames to include information as: + +* project, simulation and/or experiment acronyms, you might have to use a combination of them +* spatial coverage: the region or coordinates range covered by the data, could also be a specific domain for climate model data, e.g., ocean, land etc. +* grid: could be either a grid label or spatial resolution +* temporal coverage: a specific year/date or a temporal range +* temporal frequency: monthly, daily etc. +* type of data: again this depends on context, if the same directory contains data from different instrumentations it is important to specify the instrument in the name. For coupled model output this could be the model component, if you are using one file per variable, the variable name +* version: this is really important if you are sharing the data even if only 1 version exists at the time +* correct file extension + +# Tips for machine-readable files +* avoid special characters: ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ “ +* do not use spaces to separate words; use underscores "_" or dashes "-" or CamelCase +* use YYYYMMDD for dates, it will sort your files in chronological order, absolutely avoid "Jan, Feb, .." for months as they are much harder to code for. +* for number sequences, use leading zeros: so 001, 002,.. 020,.. 103 rather than 1, 2,.. 20, .. 103 +* try to avoid overly long names - for a single file directory keep it under 255 characters, for paths 30000. +* avoid having a large number of files in a single directory, but also an excessive number of directories with one file each +* always include file extension, some software can recognise files from their header, but this is not always the case + +## Online Resources +We partially based this page on the resources listed below, and recommend checking them for more insight and advice. + +* [Best practice to organise your data](https://www.earthdatascience.org/courses/intro-to-earth-data-science/open-reproducible-science/get-started-open-reproducible-science/best-practices-for-organizing-open-reproducible-science/) - part of an Open reproducible science course from the University of Colorado +* [Software Carpentry video covering DRS best practices](https://youtu.be/3MEJ38BO6Mo) +* [Best file naming practice handout (pdf) from Standford University](https://stanford.box.com/shared/static/yl5a04udc7hff6a61rc0egmed8xol5yd.pdf) diff --git a/Governance/tech/drs.md b/Governance/tech/drs.md index aa76cc1..49b5002 100644 --- a/Governance/tech/drs.md +++ b/Governance/tech/drs.md @@ -1,4 +1,4 @@ -# Choosing a directory structure (DRS) and filenames +# Choosing a directory structure: DRS and filenames The names you choose for files and directories and generally the way you organise your data, i.e. your directory structure, can help navigating the data, provide extra information, avoid confusion and avoid the user ending up accessing the wrong data. In many cases the best file organisation will depend on the specific research project and the actual server where the data is stored. The global climate modelling intercomparison project (CMIP) has adopted a **Data Reference Syntax (DRS)**, based on the **controlled vocabularies (CVs)** used in model metadata, to define their file names and directory structures. Here we list a few guidelines and tips to help you decide.