ingest: Use of `data` vs `results` directories #51

joverlee521 · 2024-07-10T23:38:28Z

Context

I've seen comments of the use of data and results directories in the ingest workflow not feeling quite right, so it seems like this should be explicitly discussed here

Current set up

Everything in the ingest workflow gets filed under the data directory except the final output metadata.tsv and sequences.fasta (with optional Nextclade results).

I had originally made this decision to make it easy to say all ingest outputs will be available under results (analogous to all phylogenetic outputs being in auspice). This also made it straightforward to move data from ingest to phylogenetic manually with mv ingest/results/* phylogenetic/data/.

However, I can see this being confusing since the intermediate files are technically "results" of the ingest workflow that feel weird to be under data.

Possible solutions

Option 1

Adding an explicit intermediates directory so that the use of data makes more sense:

data for everything directly fetched from outside sources
intermediates for all intermediate files produced by ingest
results remains for final output files that include metadata.tsv, sequences.fasta, and optional Nextclade output files.

Option 2

Adding a outputs directory so that the use of data and results both get shifted:

data for everything directly fetched from outside sources
results for all intermediate files produced by ingest
outputs for final output files that include metadata.tsv, sequences.fasta, and optional Nextclade output files.

The text was updated successfully, but these errors were encountered:

genehack · 2024-07-11T00:02:46Z

Option 3

Have the first step in phylogenenic have inputs of ../ingest/results/metadata.tsv and ../ingest/results/sequences.fasta and don't worry about what else might be in ingest/results?

joverlee521 · 2024-07-11T19:54:00Z

Have the first step in phylogenenic have inputs of ../ingest/results/metadata.tsv and ../ingest/results/sequences.fasta and don't worry about what else might be in ingest/results?

Yup, this is possible in the workflow when the repo has the top level nextstrain-pathogen.yaml file. It makes sense for automated runs of the ingest/phylogenetic workflow, but I have heard a cluttered results directory is a pain to navigate when running workflows manually.

Maybe @jameshadfield and/or @j23414 can chime in here on what they expect the directories to be?

j23414 · 2024-07-11T20:12:55Z

Thanks for the clarification, I was mistaken in thinking that results was the intermediate directory for both ingest and phylogenetic, geh. I usually know this, but I managed to forget yesterday. Thanks for the write up here!

I mildly lean toward option 1 but even having the difference documented here in an issue is sufficient for me to avoid the mistake in the future.

I have heard a cluttered results directory is a pain to navigate when running workflows manually.

YES, this. I've run into this when trying to explain where the final files are for the ingest workflow. It was easier having one final folder of final outputs.

jameshadfield · 2024-07-11T23:32:44Z

Maybe @jameshadfield ... can chime in here on what they expect the directories to be?

From afar, I've found it mildly confusing that ingest/results contains the final outputs of its pipeline whereas phylogenetic/results contains the intermediate files of its workflow. And some pipelines put some intermediates in ingest/results - we've had pipelines with ingest/results/{metadata.tsv,metadata_all.tsv}! At the end of the day, as long as the README clearly indicates what files to use I'm happy with any directory name/structure.

joverlee521 · 2024-07-16T00:51:48Z

Ah, okay so it seems like the confusion comes from mismatch of the use of results in ingest and phylogenetic workflows.
This makes me think that if we decide to change the structure, it should be something like option [2].

However, seems like clear documentation is enough and we can just use the existing directory structure.

Closing this issue as not planned. If anyone feels strongly enough to change the directory structure, please feel free to reopen for discussion.

joverlee521 mentioned this issue Jul 10, 2024

Update ingest nextstrain/yellow-fever#7

Merged

joverlee521 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest: Use of `data` vs `results` directories #51

ingest: Use of `data` vs `results` directories #51

joverlee521 commented Jul 10, 2024 •

edited

Loading

genehack commented Jul 11, 2024

joverlee521 commented Jul 11, 2024 •

edited

Loading

j23414 commented Jul 11, 2024

jameshadfield commented Jul 11, 2024

joverlee521 commented Jul 16, 2024

ingest: Use of data vs results directories #51

ingest: Use of data vs results directories #51

Comments

joverlee521 commented Jul 10, 2024 • edited Loading

Context

Current set up

Possible solutions

Option 1

Option 2

genehack commented Jul 11, 2024

joverlee521 commented Jul 11, 2024 • edited Loading

j23414 commented Jul 11, 2024

jameshadfield commented Jul 11, 2024

joverlee521 commented Jul 16, 2024

ingest: Use of `data` vs `results` directories #51

ingest: Use of `data` vs `results` directories #51

joverlee521 commented Jul 10, 2024 •

edited

Loading

joverlee521 commented Jul 11, 2024 •

edited

Loading