Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Use of data vs results directories #51

Closed
joverlee521 opened this issue Jul 10, 2024 · 5 comments
Closed

ingest: Use of data vs results directories #51

joverlee521 opened this issue Jul 10, 2024 · 5 comments

Comments

@joverlee521
Copy link
Contributor

joverlee521 commented Jul 10, 2024

Context

I've seen comments of the use of data and results directories in the ingest workflow not feeling quite right, so it seems like this should be explicitly discussed here

Current set up

Everything in the ingest workflow gets filed under the data directory except the final output metadata.tsv and sequences.fasta (with optional Nextclade results).

I had originally made this decision to make it easy to say all ingest outputs will be available under results (analogous to all phylogenetic outputs being in auspice). This also made it straightforward to move data from ingest to phylogenetic manually with mv ingest/results/* phylogenetic/data/.

However, I can see this being confusing since the intermediate files are technically "results" of the ingest workflow that feel weird to be under data.

Possible solutions

Option 1

Adding an explicit intermediates directory so that the use of data makes more sense:

  1. data for everything directly fetched from outside sources
  2. intermediates for all intermediate files produced by ingest
  3. results remains for final output files that include metadata.tsv, sequences.fasta, and optional Nextclade output files.

Option 2

Adding a outputs directory so that the use of data and results both get shifted:

  1. data for everything directly fetched from outside sources
  2. results for all intermediate files produced by ingest
  3. outputs for final output files that include metadata.tsv, sequences.fasta, and optional Nextclade output files.
@genehack
Copy link
Contributor

Option 3

Have the first step in phylogenenic have inputs of ../ingest/results/metadata.tsv and ../ingest/results/sequences.fasta and don't worry about what else might be in ingest/results?

@joverlee521
Copy link
Contributor Author

joverlee521 commented Jul 11, 2024

Have the first step in phylogenenic have inputs of ../ingest/results/metadata.tsv and ../ingest/results/sequences.fasta and don't worry about what else might be in ingest/results?

Yup, this is possible in the workflow when the repo has the top level nextstrain-pathogen.yaml file. It makes sense for automated runs of the ingest/phylogenetic workflow, but I have heard a cluttered results directory is a pain to navigate when running workflows manually.

Maybe @jameshadfield and/or @j23414 can chime in here on what they expect the directories to be?

@j23414
Copy link
Contributor

j23414 commented Jul 11, 2024

Thanks for the clarification, I was mistaken in thinking that results was the intermediate directory for both ingest and phylogenetic, geh. I usually know this, but I managed to forget yesterday. Thanks for the write up here!

I mildly lean toward option 1 but even having the difference documented here in an issue is sufficient for me to avoid the mistake in the future.

I have heard a cluttered results directory is a pain to navigate when running workflows manually.

YES, this. I've run into this when trying to explain where the final files are for the ingest workflow. It was easier having one final folder of final outputs.

@jameshadfield
Copy link
Member

Maybe @jameshadfield ... can chime in here on what they expect the directories to be?

From afar, I've found it mildly confusing that ingest/results contains the final outputs of its pipeline whereas phylogenetic/results contains the intermediate files of its workflow. And some pipelines put some intermediates in ingest/results - we've had pipelines with ingest/results/{metadata.tsv,metadata_all.tsv}! At the end of the day, as long as the README clearly indicates what files to use I'm happy with any directory name/structure.

@joverlee521
Copy link
Contributor Author

Ah, okay so it seems like the confusion comes from mismatch of the use of results in ingest and phylogenetic workflows.
This makes me think that if we decide to change the structure, it should be something like option [2].

However, seems like clear documentation is enough and we can just use the existing directory structure.

Closing this issue as not planned. If anyone feels strongly enough to change the directory structure, please feel free to reopen for discussion.

@joverlee521 joverlee521 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants