Ingest fixes #36

jameshadfield · 2024-05-16T23:22:43Z

The first 3 commits of #35, as that PR may never be merged.

The default line endings for `csv.DictWriter` are CRLF (amazingly) <https://docs.python.org/3/library/csv.html#csv.Dialect.lineterminator>

in preparation for the subsequent commit which will add another ingest source

joverlee521

The fixes look good to me!

One thought that came to me as I was reviewing the fauna namespace change:
If we are namespacing every independent ingest workflow, I wonder if it would make more sense for the data "source" to be at the top level.

This would change the structure from

.
└── ingest/
    ├── data/
    │   ├── fauna
    │   └── andersen-lab
    ├── results/
    │   ├── fauna
    │   └── andersen-lab
    └── s3/
        ├── fauna
        └── andersen-lab

to

.
└── ingest/
    ├── fauna/
    │   ├── data
    │   ├── results
    │   └── s3
    └── andersen-lab/
        ├── data
        ├── results
        └── s3

ingest/rules/upload_from_fauna.smk

jameshadfield · 2024-05-17T00:15:42Z

If we are namespacing every independent ingest workflow, I wonder if it would make more sense for the data "source" to be at the top level.

I really like the top-level per-source folders idea... let's do this. I have always found the data vs results distinction within ingest to be not quite right, so I'm not thrilled about recreating these within each source directory, but I also don't want to get sidetracked on making progress here.

@joverlee521 do you want to take over this PR and build your NCBI work on top of it?

tsibley · 2024-05-17T18:18:31Z

If we are namespacing every independent ingest workflow, I wonder if it would make more sense for the data "source" to be at the top level.

+1 for this. It's an approach that's worked well for me in the past when ingesting disparate sources into a single database. Each source has bespoke inputs and processing but emits conventional/standardized outputs which can be used and aggregated by downstream steps.

Parse the `output.sequences` path for the `output_dir` and the `output_fstem` that are passed to the fauna script to ensure we don't run into out of sync issues if we ever change the output.

Since we will need to namespace very data source within ingest, it makes more sense for the data source namespace to be up one level. The ingest build directory structure will look like: ``` . └── ingest/ ├── fauna/ │ ├── data │ ├── results │ └── s3 └── andersen-lab/ ├── data ├── results └── s3 ``` Based on discussion in <#36 (review)>

Make it easier to override the default configs for testing by providing the configs through a default config file.

joverlee521 · 2024-05-20T22:58:14Z

Tested fauna changes locally with

nextstrain build \
    --envdir ../env.d/seasonal-flu/ \
    ingest upload_all \
        --config "s3_dst=s3://nextstrain-data-private/files/workflows/avian-flu/trial/ingest-fixes" "segments=['ha']"

which successfully uploaded the ha files to the trial prefix.

Tested andersen-lab changes locally with

nextstrain build ingest merge_andersen_segment_metadata

which successfully completed the ingest.

I'm planning to merge this tomorrow morning. I'll make NCBI ingest changes separately.

jameshadfield added 3 commits May 16, 2024 12:43

Fix typo

6b9f1cf

use LF not CRLF for metadata

30b2640

The default line endings for `csv.DictWriter` are CRLF (amazingly) <https://docs.python.org/3/library/csv.html#csv.Dialect.lineterminator>

namespace fauna ingest files

05dd9ff

in preparation for the subsequent commit which will add another ingest source

jameshadfield requested a review from joverlee521 May 16, 2024 23:22

joverlee521 reviewed May 16, 2024

View reviewed changes

ingest/rules/upload_from_fauna.smk Outdated Show resolved Hide resolved

jameshadfield assigned joverlee521 May 20, 2024

ingest/upload_from_fauna: parse output.sequences

5e90303

Parse the `output.sequences` path for the `output_dir` and the `output_fstem` that are passed to the fauna script to ensure we don't run into out of sync issues if we ever change the output.

joverlee521 force-pushed the james/ingest-fixes branch from d17ad6b to 42c5a5e Compare May 20, 2024 22:07

ingest: move config values to defaults/config.yaml

daeac83

Make it easier to override the default configs for testing by providing the configs through a default config file.

joverlee521 merged commit 14f0758 into master May 21, 2024
6 checks passed

joverlee521 deleted the james/ingest-fixes branch May 21, 2024 16:38

joverlee521 mentioned this pull request Jul 10, 2024

ingest: Use of data vs results directories nextstrain/pathogen-repo-guide#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest fixes #36

Ingest fixes #36

jameshadfield commented May 16, 2024

joverlee521 left a comment

jameshadfield commented May 17, 2024

tsibley commented May 17, 2024

joverlee521 commented May 20, 2024

Ingest fixes #36

Ingest fixes #36

Conversation

jameshadfield commented May 16, 2024

joverlee521 left a comment

Choose a reason for hiding this comment

jameshadfield commented May 17, 2024

tsibley commented May 17, 2024

joverlee521 commented May 20, 2024