-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script for downloading local reference files and images #230
Conversation
should make local files a bit easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! My one main ideas is if you think there is a way we could get the paths to the reference files from the config file we have rather than directly declaring them here. I'm just worried about things getting out of sync in the future, but if it's too much of a headache this way works too.
get_refs.py
Outdated
|
||
## download all the files and put them in the correct locations ## | ||
print("Downloading reference files...") | ||
for path in ref_paths[0:2]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for path in ref_paths[0:2]: | |
for path in ref_paths: |
I don't think this should be here, otherwise it would only download the first 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, that was for testing. I thought I took it out!
Okay, I have now implemented this. I thought I was being clever enough to make it work with the version of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly left some textual comments here!
Speaking of docs, those will come in a separate PR.
Checking that this also will include some additional comments within the script?
This is almost certainly a problem on my end, but just in case it's not! I was running from the continuumio/miniconda3:4.10.3p0
container, and encountered this excitement:
(base) root@b815efb3cc5b:/home# python3 get_refs.py
Getting list of required reference files
Downloading reference files...
Getting homo_sapiens/ensembl-104/fasta/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Getting homo_sapiens/ensembl-104/fasta/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai
Getting homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.gtf.gz
Getting homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.mitogenes.txt
Getting homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.spliced_intron.tx2gene_3col.tsv
Getting homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.spliced_cdna.tx2gene.tsv
Getting homo_sapiens/ensembl-104/salmon_index/Homo_sapiens.GRCh38.104.spliced_intron.txome/complete_ref_lens.bin
Getting homo_sapiens/ensembl-104/salmon_index/Homo_sapiens.GRCh38.104.spliced_intron.txome/ctable.bin
ERRO[2189] error waiting for container: invalid character 'c' looking for beginning of value
I'm blaming this on weird comms with the docker engine, but wanted to share just in case this happened to someone else and it's not just a me-problem. I'm re-running now in my conda base
environment since Docker got weirdly borked after this and required a couple restarts for the daemon to get back up and running. So far, so good.
parser.add_argument("--paramfile", type=str, | ||
default="localref_params.yaml", | ||
help = "nextflow param file to write (default: `localref_params.yaml`)") | ||
parser.add_argument("--revision", type=str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might suggest this be named --version
, --workflow_version
, --release
, etc. or similar. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using revision because that is the term used by nextflow. I don't love it, but I wanted to be consistent with the terminology used there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed for consistency! But it's a disappointing fact to learn as I start my nextflow learning adventure.
get_refs.py
Outdated
ref_file = urllib.request.urlopen(reffile_url) | ||
except urllib.error.URLError as e: | ||
print(e.reason) | ||
print(f"The file download failed for {reffile_url}, please check the URL for errors") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print(f"The file download failed for {reffile_url}, please check the URL for errors") | |
print(f"The file download failed for {reffile_url}. Please check the URL for errors.") |
get_refs.py
Outdated
parser.add_argument("--refdir", type=str, | ||
default="scpca-references", | ||
help = "destination directory for downloaded reference files") | ||
parser.add_argument("--replace", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest --overwrite
here instead of --replace
. Maybe even overwrite_refs
?
default="localref_params.yaml", | ||
help = "nextflow param file to write (default: `localref_params.yaml`)") | ||
parser.add_argument("--revision", type=str, | ||
default="main", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if there is a reason to set this default to development
instead, since that matches the current functionality. But, I am disagreeing with myself a good deal even as I write this, because of this that will get messy when this eventually is merged into main
and until merge one just provides via command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... in theory it should work with main
once merged, so this should only be a problem for us and the initial testers, who will get a bit more hand-holding.
get_refs.py
Outdated
ref_paths += [barcode_dir / f for f in barcode_files] | ||
|
||
## download all the files and put them in the correct locations ## | ||
print("Downloading reference files...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to add a blurb about "this will take a while, why not go take a coffee break?" Any kind of expectation-setting statement about runtime, ☕ or otherwise.
get_refs.py
Outdated
pfile = Path(args.paramfile) | ||
# check if paramfile exists & move old if needed | ||
if pfile.exists(): | ||
print(f"A file already exists at `{pfile}`, renaming previous file to `{pfile.name}.bak`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print(f"A file already exists at `{pfile}`, renaming previous file to `{pfile.name}.bak`") | |
print(f"A file already exists at `{pfile}`. Renaming existing file to `{pfile.name}.bak` and writing new file to `{pfile}`.") |
get_refs.py
Outdated
urllib.request.urlretrieve(file_url, outfile) | ||
except urllib.error.URLError as e: | ||
print(e.reason) | ||
print(f"The file download failed for {file_url}, please check the URL for errors", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print(f"The file download failed for {file_url}, please check the URL for errors", | |
print(f"The file download failed for {file_url}. Please check the URL for errors.", |
get_refs.py
Outdated
container_file = urllib.request.urlopen(containerfile_url) | ||
except urllib.error.URLError as e: | ||
print(e.reason) | ||
print(f"The file download failed for {container_url}, please check the URL for errors") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print(f"The file download failed for {container_url}, please check the URL for errors") | |
print(f"The file download failed for {container_url}. Please check the URL for errors.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor comments, but this looks good to me. I was able to test pulling docker and everything seemed to work for me.
# get assembly and root location | ||
assembly = refs.get("assembly", "NA") | ||
root_parts = refs.get("ref_rootdir").split('://') | ||
if root_parts[0] == 's3': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here that you are grabbing the bucket name? I was a bit confused here on what 0, 1 were
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this isn't quite correct, I realized (or as flexible as it needs to be). I will add comments as I update.
I'm not sure what that would be, to be honest... I'm somewhat surprised this worked as written above: without specifying Seems related to docker/for-mac#5139... There should be no reason to run this in docker, as long as you have some version of python3 installed; if I am using something not available in an earlier version of python3, I'd like to know about it!
I added a few more comments, but I was not planning to add much more. Was there something specific you were looking for? |
Mostly looking for a docstring or so at the top of the script with 1-2 sentences about what the script does and why it might be useful. |
Yes, that should be there; done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
For the case when compute nodes may not have access to the internet, we will need to download all reference files and docker images before the workflow begins, as described in #211. This PR adds a script to perform these downloads, preserving the file structure that would be found on S3, which allows us to change only one parameter for running from these local files.
To make things a bit easier, the script also outputs a file,
local_refs.yaml
by default, which contains the setting that needs updating (ref_rootdir
, is the only required one, but I also includedassembly
for clarity). I discovered that this setting can not be changed in a user-provided config file that specifiesparams.ref_rootdir
unless that config file also contains all of the affected param variables inreference_paths.config
. This probably has to do with the order in which config files are read and some opaque precedence rules. So the solution is either to use--ref_rootdir
on the command line, which will properly override the setting inreference_paths.config
or to use-params-file local_refs.yaml
or similar. The latter is more traceable, so that is the direction I took with this download.Downloading STAR and Cell Ranger indexes is optional (these are large), as is downloading docker or singularity packages to the local cache. I ran into some strange behavior with singularity seeming to not wanting to download files to one cache location if it had already downloaded to another, so I force those pull requests.
One unresolved issue is downloading any Space Ranger image that might be required; it won't get those from the github repo, but that is probably part of a larger issue, if people are going to use that functionality. I will try to address the way to handle this in docs.
For now, the script is in the root of the repository. If there are thoughts on where it should ultimately live, please let me know!
Speaking of docs, those will come in a separate PR.