Long format (lvl 1) - datum descriptions #61
ktoddbrown
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This thread is intended to record discussion and decision points when designing the datum grouping. In the long data format (lvl 1) that we use as an intermediate between the original data format (lvl 0) and the curated data product (lvl 2). The long data format currently has two main components: the id locating the datum grouping and the datum description. We will focus on the datum description in this thread. (See #60 for discussion on groupings.)
We often think of data as recorded values. However these values have associated units, methods, uncertainties, control vocabularies, and such. Often the non-value portion of a datum are recorded as metadata separately from the primary data. As you can guess, there are almost infinite number of 'things' associated with values so we use a long tuple format to try to be as flexible as possible
of_variable
,is_type
,with_entry
.This means that a organic carbon fraction of 0.10 determined from loss on ignition might be written as follows.
of_variable
is_type
with_entry
Decision point Do we repeat the 'meta' data for each unique id? It's possible that we could instead report the unit for the study or the site instead of repeating it for each layer id. This might make it slightly harder to work with computationally but would greatly save on memory.
Level 0 data
of_variable
here is most closely linked to the column name. However often the units or methods may need to be extracted from the column name or other contextual data. Occasionally a unit or method is in a second column of the original primary data. We deal with this by creating an annotation table for each level 0 data that provides this mapping or extracts the unit information.Future adaptations
This works very well for data like organic carbon, bulk density, or soil color. However it starts to break down a bit for common entities like citation, people, time, or places. These entities have distinct elements that do not break down into the above types. It could be useful to provide categories for these data models that restrict to expected types in the future.
Beta Was this translation helpful? Give feedback.
All reactions