Long format (lvl 1) - datum groupings #60

ktoddbrown · 2024-11-22T15:28:21Z

ktoddbrown
Nov 22, 2024
Maintainer

This thread is intended to record discussion and decision points when designing the datum grouping. In the long data format (lvl 1) that we use as an intermediate between the original data format (lvl 0) and the curated data product (lvl 2). The long data format currently has two main components: the id locating the datum grouping and the datum description. We will focus on the grouping in this thread.

Data group together and often have a hierarchical relationship with other datum. In the relational databases this grouping is often represented by row and table relationships. One exception here is that multiple rows within a table can be assigned to groupings as well (ie treatment/control). This leads to our first decision point.

Decision point lvl0:1 I propose treating multi-rows groupings as datum rather then id's.

Not included here are the datum that we will discuss further in other threads.

ISCN3

ISCN3 consists of four tables dataset, citation, profile, and layer. Note that the id's are generally not unique id's and the set of id's and foreign keys must be referred to together to generate a unique row identifier. Also note that dataset_name is cross referenced with dataset_name_sub and may be crossed with dataset_name_soc. dataset_name_soc may also refer to ISCN soil organic carbon stock gap filling. Decision point ISCN:1 We consider dataset_name_soc to be a soil organic carbon method and part of a unique row identifier for the profile.

erDiagram
    dataset ||--|{ citation : has
    dataset ||--o{ profile : has
    dataset ||--o{ layer : has
    profile ||--|{ layer : has

    dataset {
        id dataset_name
    }

    citation {
       dataset_id dataset_name
    }
    profile {
       dataset_id dataset_name_sub
       id dataset_name_soc
       id site_name
       id profile_name
    }
    layer {
       dataset_id dataset_name_sub
       profile_id dataset_name_soc
       profile_id site_name
       profile_id profile_name
       id layer_name
    }

Intermediate data Lvl 1

The intermediate data model for this project has oscillated between three structures:

a single long table unjoined table with identifiers stacked and implying that any missing id apply to all specified ids.
a single long table with any original tables joined together into a long, dense table.
a set of three tables representing the datum provenance (study), surface location (site), and depth location (layer).

1. Stacked

This unjoined stack option is, in some ways, a non-option. It preserves the original table structures most closely. This moves a lot of the data manipulation work into the curation phase and maintains any data normalization (ie avoidance of repeated data) done in the original study.

However it removes some of the advantage of a more unified data model and makes the lvl1 data much more difficult to work with when creating subsequent data products.

2. Joined

The joined stack option is probably the most idealized option. Each datum has a complete association with the ids, for example, a citation associated with multiple layers would be repeated for each of those layers. This makes it fairly easy to work with using dplyr::filter and unique as a data table but results in a very high memory demand data object. This rapidly becomes intractable for almost all larger survey data on most desktop computers.

3. Three table

The three table claims that we have some underlaying understand of the datum; it comes from somewhere and has provenance and observation are associated with some geolocation that may have a time element on the surface of the Earth, some of which are also associated with a specific layer with a depth interval. This places the highest burden on the level 1 coding since all datums need to be associated with these three tables.

erDiagram
    provenance ||--o{ site : has
    provenance ||--o{ layer : has
    site ||--|{ layer : has

ktoddbrown · 2024-11-22T19:33:45Z

ktoddbrown
Nov 22, 2024
Maintainer Author

It might make sense to move the current read files from returning a genetic 'long' format to the specific 'stacked', 'joined', or 'threeTable' formats described above.

0 replies

ktoddbrown · 2024-11-22T19:36:27Z

ktoddbrown
Nov 22, 2024
Maintainer Author

This three table format is very centered on geolocation applications (like digital soil mapping) rather then observation centered applications (like pedotransfer function development). Maybe call it 'geolocation' instead?

0 replies

brandonnodnarb · 2025-01-08T01:26:24Z

brandonnodnarb
Jan 8, 2025
Collaborator

ok, assuming I am understanding (big IF)...the issue(s) are:

ISCN3 consists of four tables dataset, citation, profile, and layer. Note that the id's are generally not unique id's and the set of id's and foreign keys must be referred to together to generate a unique row identifier. Also note that dataset_name is cross referenced with dataset_name_sub and may be crossed with dataset_name_soc. dataset_name_soc may also refer to ISCN soil organic carbon stock gap filling. Decision point ISCN:1 We consider dataset_name_soc to be a soil organic carbon method and part of a unique row identifier for the profile.

Which means there are no globally unique IDs; tabular data are primarily not stand alone, they require values from other tables to maintain identity/coherence/context. To further complicate matters, the semantics of dataset_name_soc is overloaded; this likely will require additional coding or manual intervention to resolve.

Are there any other concepts, qualities, or observations with overloaded semantics?
Are more tables the answer here? I'm concerned this will end up as a turtles all the way down endeavor.

It seems that profile, layer and site are views, with dataset carrying a collection of views. Perhaps inverting this approach would address creating tables of tables of tables--i.e. list all data qualities along with their associated views and first order relations?

I'll see if I can work up a suitable example. More on this to come.

0 replies

brandonnodnarb · 2025-01-08T21:19:43Z

brandonnodnarb
Jan 8, 2025
Collaborator

Also, the intractability of (2) for local machines could be an opportunity to test CyVerse, or HiPerGator, or similar.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long format (lvl 1) - datum groupings #60

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Long format (lvl 1) - datum groupings #60

ktoddbrown Nov 22, 2024 Maintainer

ISCN3

Intermediate data Lvl 1

1. Stacked

2. Joined

3. Three table

Replies: 4 comments

ktoddbrown Nov 22, 2024 Maintainer Author

ktoddbrown Nov 22, 2024 Maintainer Author

brandonnodnarb Jan 8, 2025 Collaborator

brandonnodnarb Jan 8, 2025 Collaborator

ktoddbrown
Nov 22, 2024
Maintainer

ktoddbrown
Nov 22, 2024
Maintainer Author

ktoddbrown
Nov 22, 2024
Maintainer Author

brandonnodnarb
Jan 8, 2025
Collaborator

brandonnodnarb
Jan 8, 2025
Collaborator