Skip to content
Stephen Pascoe edited this page Apr 9, 2014 · 5 revisions
Wiki Reorganisation
This page has been classified for reorganisation. It has been given the category REVISE.
This page contains useful content but needs revision. It may contain out of date or inaccurate content.

Why we use symbolic links in drslib

This document explains the reasoning behind one aspect of drslib's implementation of the Data Reference Syntax [1] that has proved controvercial, namely the use of symbolic links to maintain multiple dataset versions on the filesystem. In my view the current implementation represents a reasonable compromise between conflicting requirements and that any other structure would reduce functionality or result in replacing one set of challenges with more onerous ones. However, I recognise that symbolic links adds an administrative burden to management of CMIP5 data and therefore the case has to be made for their use. Here I seek make that case.

DRS Directory Structure overview

For the purpose of this discussion we will ignore most of the DRS directory structure and focus on the structure below the publication-level dataset. In general terms a publication-level dataset is a bag of related variables from a particular simulation.

The DRS says a publicaiton-level dataset should be structured as follows

DATASET
|-- vYYYYMMDD/
|   |-- var1
|   |   `-- var1_*.nc
|   |...
|   `-- var_n/
|-- v...
`-- latest

I.e. a series of subdirectories representing each dataset version. Version directory containing subdirectories for each variable which in turn contain NetCDF files for that variable at that version. The latest directory should also contain variable directories for the latest version.

Implementation Requirements

The way drslib implements DRS directory structure was chosen with 4 main requirements in mind:

  1. It should allow data from multiple vesions to be kept on disk simultaneously.
  2. It should avoid storing multiple copies of files that are present in more than one version.
  3. It should be straightforward to copy dataset changes (i.e. differences between versions) between nodes to allow efficient replication.
  4. It should rely only on the filesystem so that generic tools like FTP could be used to expose the structure if necessary.

You can see that #1 and #2 are conflicting without some sort of indirection and #4 points us to use hard or symbolic links (unless we resort to filesystem abstractions like FUSE, custom NFS server or the like). I argue that item #3 also affects what is a reasonable solution.

Solution implemented in drslib

Drslib adds another subdirectory to the dataset named files which contains the files added or changed for each version and then constructs relative symbolic links into the version directories.

This can be best illustrated with an example. For instance below is a dataset at 2 versions. In the second version 1 file is replaced and 2 further added.

DATASET
|-- files
|   |-- thetao_20091023
|   |   |-- thetao_1.nc
|   |   |-- thetao_2.nc
|   |   |-- thetao_3.nc (v1)
|   `-- thetao_20100101
|       |-- thetao_3.nc (v2)
|       |-- thetao_4.nc
|       `-- thetao_5.nc
|-- v20091023
|   `-- thetao
|       |-- thetao_1.nc -> ../../files/thetao_20091023/thetao_1.nc
|       |-- thetao_2.nc -> ../../files/thetao_20091023/thetao_2.nc
|       `-- thetao_3.nc -> ../../files/thetao_20091023/thetao_3.nc
|-- v20100101
|   `-- thetao
|       |-- thetao_1.nc -> ../../files/thetao_20091023/thetao_1.nc
|       |-- thetao_2.nc -> ../../files/thetao_20091023/thetao_2.nc
|       |-- thetao_3.nc -> ../../files/thetao_20100101/thetao_3.nc
|       |-- thetao_4.nc -> ../../files/thetao_20100101/thetao_4.nc
|       `-- thetao_5.nc -> ../../files/thetao_20100101/thetao_5.nc
`-- latest -> v20100101

This implementation has been critisized on several grounds. Firstly that we should avoid symbolic links at all costs but I have already argued that this would be counter to the requirements. Secondly it is true that many remote copy tools will not follow symbolic links correctly: for instance rsync will work but GridFTP doesn't at this time. Therefore surely it would be better to use symbolic links only for versions prior to the most recent? This would lead to an implementation similar to the one below.

Alternative Implementation

# At Node A

DATASET
|-- v20091023
|   `-- thetao
|       |-- thetao_1.nc --> ../../v20100101/thetao_1.nc
|       |-- thetao_2.nc --> ../../v20100101/thetao_1.nc
|       `-- thetao_3.nc (v1)
|-- v20100101
|   `-- thetao
|       |-- thetao_1.nc
|       |-- thetao_2.nc
|       |-- thetao_3.nc (v2)
|       |-- thetao_4.nc
|       `-- thetao_5.nc
`-- latest -> v20100101

In this case the v201100101 directory, representing the latest version, contains the real NetCDF files for that version and previous versions contain deprecated files and symbolic links. This has the advantage that remote copy tools can copy the latest version in full DRS structure without following symbolic links.

However, consider node A has the above structure and node B has the dataset at the previous version v20091023:

# At Node B

DATASET
|-- v20091023
|   `-- thetao
|       |-- thetao_1.nc
|       |-- thetao_2.nc
|       |-- thetao_3.nc (v1)
`-- latest -> v20091023

How would node B upgrade to v20100101? There is no a set of directories that contain only the real files of v20100101. If the remote copy tool is smart enough to recognise symbolic links it is still likely to be fooled by files moving to new directories there fore it is likely to copy theta_1.nc and theta_2.nc needlessly. It appears we will need to calcuate that files theta_3.nc (v2), theta_4.nc and theta_5.nc the only files that need copying to node B.

Once these files are copied to node B then it must move theta_1.nc and theta_2.nc from v20091023 to v20100101 and replace them by symbolic links. Therefore files that remain between versions must be moved at each version change, which is arguably more prone to errors than creating new symbolic links.

The process with the original implementation is somwhat simpler, although still requires tool support. If the remote copy tool isn't aware of symbolic links every directory matching files/*_20100101 must be copied to node B, then the new version directory recreated.

Conclusion

I believe that there is no easy solution within the constrains of the requirements. Drslib's implementation probably represents the best compromise available. Unless the requirements are reconsidered, for instance if we accepted the need to copy files multiple times when replicating or that we must keep multiple copies of some files when we store multiple versions, we are unlikely to find a better solution.

References

[1] http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf

Clone this wiki locally