Long format (lvl 1) - datum groupings #60
Replies: 4 comments
-
It might make sense to move the current read files from returning a genetic 'long' format to the specific 'stacked', 'joined', or 'threeTable' formats described above. |
Beta Was this translation helpful? Give feedback.
-
This three table format is very centered on geolocation applications (like digital soil mapping) rather then observation centered applications (like pedotransfer function development). Maybe call it 'geolocation' instead? |
Beta Was this translation helpful? Give feedback.
-
ok, assuming I am understanding (big IF)...the issue(s) are:
Which means there are no globally unique IDs; tabular data are primarily not stand alone, they require values from other tables to maintain identity/coherence/context. To further complicate matters, the semantics of Are there any other concepts, qualities, or observations with overloaded semantics? It seems that I'll see if I can work up a suitable example. More on this to come. |
Beta Was this translation helpful? Give feedback.
-
Also, the intractability of (2) for local machines could be an opportunity to test CyVerse, or HiPerGator, or similar. |
Beta Was this translation helpful? Give feedback.
-
This thread is intended to record discussion and decision points when designing the datum grouping. In the long data format (lvl 1) that we use as an intermediate between the original data format (lvl 0) and the curated data product (lvl 2). The long data format currently has two main components: the id locating the datum grouping and the datum description. We will focus on the grouping in this thread.
Data group together and often have a hierarchical relationship with other datum. In the relational databases this grouping is often represented by row and table relationships. One exception here is that multiple rows within a table can be assigned to groupings as well (ie treatment/control). This leads to our first decision point.
Decision point lvl0:1 I propose treating multi-rows groupings as datum rather then id's.
Not included here are the datum that we will discuss further in other threads.
ISCN3
ISCN3 consists of four tables
dataset
,citation
,profile
, andlayer
. Note that the id's are generally not unique id's and the set of id's and foreign keys must be referred to together to generate a unique row identifier. Also note thatdataset_name
is cross referenced withdataset_name_sub
and may be crossed withdataset_name_soc
.dataset_name_soc
may also refer to ISCN soil organic carbon stock gap filling. Decision point ISCN:1 We considerdataset_name_soc
to be a soil organic carbon method and part of a unique row identifier for theprofile
.Intermediate data Lvl 1
The intermediate data model for this project has oscillated between three structures:
1. Stacked
This unjoined stack option is, in some ways, a non-option. It preserves the original table structures most closely. This moves a lot of the data manipulation work into the curation phase and maintains any data normalization (ie avoidance of repeated data) done in the original study.
However it removes some of the advantage of a more unified data model and makes the lvl1 data much more difficult to work with when creating subsequent data products.
2. Joined
The joined stack option is probably the most idealized option. Each datum has a complete association with the ids, for example, a citation associated with multiple layers would be repeated for each of those layers. This makes it fairly easy to work with using
dplyr::filter
andunique
as a data table but results in a very high memory demand data object. This rapidly becomes intractable for almost all larger survey data on most desktop computers.3. Three table
The three table claims that we have some underlaying understand of the datum; it comes from somewhere and has
provenance
and observation are associated with some geolocation that may have a time element on thesurface
of the Earth, some of which are also associated with a specificlayer
with a depth interval. This places the highest burden on the level 1 coding since all datums need to be associated with these three tables.Beta Was this translation helpful? Give feedback.
All reactions