Long format (lvl 1) - datum descriptions #61

ktoddbrown · 2024-11-22T19:31:23Z

ktoddbrown
Nov 22, 2024
Maintainer

This thread is intended to record discussion and decision points when designing the datum grouping. In the long data format (lvl 1) that we use as an intermediate between the original data format (lvl 0) and the curated data product (lvl 2). The long data format currently has two main components: the id locating the datum grouping and the datum description. We will focus on the datum description in this thread. (See #60 for discussion on groupings.)

We often think of data as recorded values. However these values have associated units, methods, uncertainties, control vocabularies, and such. Often the non-value portion of a datum are recorded as metadata separately from the primary data. As you can guess, there are almost infinite number of 'things' associated with values so we use a long tuple format to try to be as flexible as possible of_variable, is_type, with_entry.

This means that a organic carbon fraction of 0.10 determined from loss on ignition might be written as follows.

`of_variable`	`is_type`	`with_entry`
organic_carbon	unit	mass ratio of organic carbon to fine earth
organic_carbon	value	0.10
organic_carbon	method	calculated from loss on ignition

Decision point Do we repeat the 'meta' data for each unique id? It's possible that we could instead report the unit for the study or the site instead of repeating it for each layer id. This might make it slightly harder to work with computationally but would greatly save on memory.

Level 0 data

of_variable here is most closely linked to the column name. However often the units or methods may need to be extracted from the column name or other contextual data. Occasionally a unit or method is in a second column of the original primary data. We deal with this by creating an annotation table for each level 0 data that provides this mapping or extracts the unit information.

Future adaptations

This works very well for data like organic carbon, bulk density, or soil color. However it starts to break down a bit for common entities like citation, people, time, or places. These entities have distinct elements that do not break down into the above types. It could be useful to provide categories for these data models that restrict to expected types in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long format (lvl 1) - datum descriptions #61

{{title}}

Replies: 0 comments

Select a reply

Long format (lvl 1) - datum descriptions #61

ktoddbrown Nov 22, 2024 Maintainer

Level 0 data

Future adaptations

Replies: 0 comments

ktoddbrown
Nov 22, 2024
Maintainer