Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest fixes #36

Merged
merged 6 commits into from
May 21, 2024
Merged

Ingest fixes #36

merged 6 commits into from
May 21, 2024

Conversation

jameshadfield
Copy link
Member

The first 3 commits of #35, as that PR may never be merged.

The default line endings for `csv.DictWriter` are CRLF (amazingly)

<https://docs.python.org/3/library/csv.html#csv.Dialect.lineterminator>
in preparation for the subsequent commit which will add another ingest
source
@jameshadfield jameshadfield requested a review from joverlee521 May 16, 2024 23:22
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixes look good to me!

One thought that came to me as I was reviewing the fauna namespace change:
If we are namespacing every independent ingest workflow, I wonder if it would make more sense for the data "source" to be at the top level.

This would change the structure from

.
└── ingest/
    ├── data/
    │   ├── fauna
    │   └── andersen-lab
    ├── results/
    │   ├── fauna
    │   └── andersen-lab
    └── s3/
        ├── fauna
        └── andersen-lab

to

.
└── ingest/
    ├── fauna/
    │   ├── data
    │   ├── results
    │   └── s3
    └── andersen-lab/
        ├── data
        ├── results
        └── s3

ingest/rules/upload_from_fauna.smk Outdated Show resolved Hide resolved
@jameshadfield
Copy link
Member Author

If we are namespacing every independent ingest workflow, I wonder if it would make more sense for the data "source" to be at the top level.

I really like the top-level per-source folders idea... let's do this. I have always found the data vs results distinction within ingest to be not quite right, so I'm not thrilled about recreating these within each source directory, but I also don't want to get sidetracked on making progress here.

@joverlee521 do you want to take over this PR and build your NCBI work on top of it?

@tsibley
Copy link
Member

tsibley commented May 17, 2024

If we are namespacing every independent ingest workflow, I wonder if it would make more sense for the data "source" to be at the top level.

+1 for this. It's an approach that's worked well for me in the past when ingesting disparate sources into a single database. Each source has bespoke inputs and processing but emits conventional/standardized outputs which can be used and aggregated by downstream steps.

Parse the `output.sequences` path for the `output_dir` and the
`output_fstem` that are passed to the fauna script to ensure we don't
run into out of sync issues if we ever change the output.
joverlee521 added a commit that referenced this pull request May 20, 2024
Since we will need to namespace very data source within ingest, it makes
more sense for the data source namespace to be up one level.

The ingest build directory structure will look like:

```
.
└── ingest/
    ├── fauna/
    │   ├── data
    │   ├── results
    │   └── s3
    └── andersen-lab/
        ├── data
        ├── results
        └── s3
```

Based on discussion in
<#36 (review)>
Since we will need to namespace very data source within ingest, it makes
more sense for the data source namespace to be up one level.

The ingest build directory structure will look like:

```
.
└── ingest/
    ├── fauna/
    │   ├── data
    │   ├── results
    │   └── s3
    └── andersen-lab/
        ├── data
        ├── results
        └── s3
```

Based on discussion in
<#36 (review)>
@joverlee521 joverlee521 force-pushed the james/ingest-fixes branch from d17ad6b to 42c5a5e Compare May 20, 2024 22:07
Make it easier to override the default configs for testing by providing
the configs through a default config file.
@joverlee521
Copy link
Contributor

Tested fauna changes locally with

nextstrain build \
    --envdir ../env.d/seasonal-flu/ \
    ingest upload_all \
        --config "s3_dst=s3://nextstrain-data-private/files/workflows/avian-flu/trial/ingest-fixes" "segments=['ha']"

which successfully uploaded the ha files to the trial prefix.

Tested andersen-lab changes locally with

nextstrain build ingest merge_andersen_segment_metadata

which successfully completed the ingest.


I'm planning to merge this tomorrow morning. I'll make NCBI ingest changes separately.

@joverlee521 joverlee521 merged commit 14f0758 into master May 21, 2024
6 checks passed
@joverlee521 joverlee521 deleted the james/ingest-fixes branch May 21, 2024 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants