-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest fixes #36
Ingest fixes #36
Conversation
The default line endings for `csv.DictWriter` are CRLF (amazingly) <https://docs.python.org/3/library/csv.html#csv.Dialect.lineterminator>
in preparation for the subsequent commit which will add another ingest source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fixes look good to me!
One thought that came to me as I was reviewing the fauna namespace change:
If we are namespacing every independent ingest workflow, I wonder if it would make more sense for the data "source" to be at the top level.
This would change the structure from
.
└── ingest/
├── data/
│ ├── fauna
│ └── andersen-lab
├── results/
│ ├── fauna
│ └── andersen-lab
└── s3/
├── fauna
└── andersen-lab
to
.
└── ingest/
├── fauna/
│ ├── data
│ ├── results
│ └── s3
└── andersen-lab/
├── data
├── results
└── s3
I really like the top-level per-source folders idea... let's do this. I have always found the @joverlee521 do you want to take over this PR and build your NCBI work on top of it? |
+1 for this. It's an approach that's worked well for me in the past when ingesting disparate sources into a single database. Each source has bespoke inputs and processing but emits conventional/standardized outputs which can be used and aggregated by downstream steps. |
Parse the `output.sequences` path for the `output_dir` and the `output_fstem` that are passed to the fauna script to ensure we don't run into out of sync issues if we ever change the output.
Since we will need to namespace very data source within ingest, it makes more sense for the data source namespace to be up one level. The ingest build directory structure will look like: ``` . └── ingest/ ├── fauna/ │ ├── data │ ├── results │ └── s3 └── andersen-lab/ ├── data ├── results └── s3 ``` Based on discussion in <#36 (review)>
Since we will need to namespace very data source within ingest, it makes more sense for the data source namespace to be up one level. The ingest build directory structure will look like: ``` . └── ingest/ ├── fauna/ │ ├── data │ ├── results │ └── s3 └── andersen-lab/ ├── data ├── results └── s3 ``` Based on discussion in <#36 (review)>
d17ad6b
to
42c5a5e
Compare
Make it easier to override the default configs for testing by providing the configs through a default config file.
Tested fauna changes locally with
which successfully uploaded the Tested andersen-lab changes locally with nextstrain build ingest merge_andersen_segment_metadata which successfully completed the ingest. I'm planning to merge this tomorrow morning. I'll make NCBI ingest changes separately. |
The first 3 commits of #35, as that PR may never be merged.