Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script for downloading local reference files and images #230

Merged
merged 21 commits into from
Nov 1, 2022

Conversation

jashapiro
Copy link
Member

For the case when compute nodes may not have access to the internet, we will need to download all reference files and docker images before the workflow begins, as described in #211. This PR adds a script to perform these downloads, preserving the file structure that would be found on S3, which allows us to change only one parameter for running from these local files.

To make things a bit easier, the script also outputs a file, local_refs.yaml by default, which contains the setting that needs updating (ref_rootdir, is the only required one, but I also included assembly for clarity). I discovered that this setting can not be changed in a user-provided config file that specifies params.ref_rootdir unless that config file also contains all of the affected param variables in reference_paths.config. This probably has to do with the order in which config files are read and some opaque precedence rules. So the solution is either to use --ref_rootdir on the command line, which will properly override the setting in reference_paths.config or to use -params-file local_refs.yaml or similar. The latter is more traceable, so that is the direction I took with this download.

Downloading STAR and Cell Ranger indexes is optional (these are large), as is downloading docker or singularity packages to the local cache. I ran into some strange behavior with singularity seeming to not wanting to download files to one cache location if it had already downloaded to another, so I force those pull requests.

One unresolved issue is downloading any Space Ranger image that might be required; it won't get those from the github repo, but that is probably part of a larger issue, if people are going to use that functionality. I will try to address the way to handle this in docs.

For now, the script is in the root of the repository. If there are thoughts on where it should ultimately live, please let me know!

Speaking of docs, those will come in a separate PR.

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! My one main ideas is if you think there is a way we could get the paths to the reference files from the config file we have rather than directly declaring them here. I'm just worried about things getting out of sync in the future, but if it's too much of a headache this way works too.

get_refs.py Outdated Show resolved Hide resolved
get_refs.py Outdated Show resolved Hide resolved
get_refs.py Show resolved Hide resolved
get_refs.py Outdated

## download all the files and put them in the correct locations ##
print("Downloading reference files...")
for path in ref_paths[0:2]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for path in ref_paths[0:2]:
for path in ref_paths:

I don't think this should be here, otherwise it would only download the first 2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, that was for testing. I thought I took it out!

@jashapiro
Copy link
Member Author

My one main ideas is if you think there is a way we could get the paths to the reference files from the config file we have rather than directly declaring them here.

Okay, I have now implemented this. I thought I was being clever enough to make it work with the version of reference_paths.config that is in main, but there were more changes there than I expected, and the logic to handle them does not seem worth it at all. So if you want to test this, you will need to run it with --revision development.

@jashapiro jashapiro requested a review from allyhawkins October 28, 2022 18:21
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly left some textual comments here!

Speaking of docs, those will come in a separate PR.

Checking that this also will include some additional comments within the script?

This is almost certainly a problem on my end, but just in case it's not! I was running from the continuumio/miniconda3:4.10.3p0 container, and encountered this excitement:

(base) root@b815efb3cc5b:/home# python3 get_refs.py 
Getting list of required reference files
Downloading reference files...
Getting homo_sapiens/ensembl-104/fasta/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Getting homo_sapiens/ensembl-104/fasta/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai
Getting homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.gtf.gz
Getting homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.mitogenes.txt
Getting homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.spliced_intron.tx2gene_3col.tsv
Getting homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.spliced_cdna.tx2gene.tsv
Getting homo_sapiens/ensembl-104/salmon_index/Homo_sapiens.GRCh38.104.spliced_intron.txome/complete_ref_lens.bin
Getting homo_sapiens/ensembl-104/salmon_index/Homo_sapiens.GRCh38.104.spliced_intron.txome/ctable.bin
ERRO[2189] error waiting for container: invalid character 'c' looking for beginning of value 

I'm blaming this on weird comms with the docker engine, but wanted to share just in case this happened to someone else and it's not just a me-problem. I'm re-running now in my conda base environment since Docker got weirdly borked after this and required a couple restarts for the daemon to get back up and running. So far, so good.

parser.add_argument("--paramfile", type=str,
default="localref_params.yaml",
help = "nextflow param file to write (default: `localref_params.yaml`)")
parser.add_argument("--revision", type=str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest this be named --version, --workflow_version, --release, etc. or similar. Up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using revision because that is the term used by nextflow. I don't love it, but I wanted to be consistent with the terminology used there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed for consistency! But it's a disappointing fact to learn as I start my nextflow learning adventure.

get_refs.py Outdated
ref_file = urllib.request.urlopen(reffile_url)
except urllib.error.URLError as e:
print(e.reason)
print(f"The file download failed for {reffile_url}, please check the URL for errors")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f"The file download failed for {reffile_url}, please check the URL for errors")
print(f"The file download failed for {reffile_url}. Please check the URL for errors.")

get_refs.py Outdated
parser.add_argument("--refdir", type=str,
default="scpca-references",
help = "destination directory for downloaded reference files")
parser.add_argument("--replace",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest --overwrite here instead of --replace. Maybe even overwrite_refs?

get_refs.py Show resolved Hide resolved
get_refs.py Show resolved Hide resolved
default="localref_params.yaml",
help = "nextflow param file to write (default: `localref_params.yaml`)")
parser.add_argument("--revision", type=str,
default="main",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if there is a reason to set this default to development instead, since that matches the current functionality. But, I am disagreeing with myself a good deal even as I write this, because of this that will get messy when this eventually is merged into main and until merge one just provides via command line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... in theory it should work with main once merged, so this should only be a problem for us and the initial testers, who will get a bit more hand-holding.

get_refs.py Outdated
ref_paths += [barcode_dir / f for f in barcode_files]

## download all the files and put them in the correct locations ##
print("Downloading reference files...")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to add a blurb about "this will take a while, why not go take a coffee break?" Any kind of expectation-setting statement about runtime, ☕ or otherwise.

get_refs.py Outdated
pfile = Path(args.paramfile)
# check if paramfile exists & move old if needed
if pfile.exists():
print(f"A file already exists at `{pfile}`, renaming previous file to `{pfile.name}.bak`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f"A file already exists at `{pfile}`, renaming previous file to `{pfile.name}.bak`")
print(f"A file already exists at `{pfile}`. Renaming existing file to `{pfile.name}.bak` and writing new file to `{pfile}`.")

get_refs.py Outdated
urllib.request.urlretrieve(file_url, outfile)
except urllib.error.URLError as e:
print(e.reason)
print(f"The file download failed for {file_url}, please check the URL for errors",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f"The file download failed for {file_url}, please check the URL for errors",
print(f"The file download failed for {file_url}. Please check the URL for errors.",

get_refs.py Outdated
container_file = urllib.request.urlopen(containerfile_url)
except urllib.error.URLError as e:
print(e.reason)
print(f"The file download failed for {container_url}, please check the URL for errors")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f"The file download failed for {container_url}, please check the URL for errors")
print(f"The file download failed for {container_url}. Please check the URL for errors.")

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments, but this looks good to me. I was able to test pulling docker and everything seemed to work for me.

get_refs.py Outdated Show resolved Hide resolved
# get assembly and root location
assembly = refs.get("assembly", "NA")
root_parts = refs.get("ref_rootdir").split('://')
if root_parts[0] == 's3':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here that you are grabbing the bucket name? I was a bit confused here on what 0, 1 were

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this isn't quite correct, I realized (or as flexible as it needs to be). I will add comments as I update.

@jashapiro
Copy link
Member Author

This is almost certainly a problem on my end, but just in case it's not! I was running from the continuumio/miniconda3:4.10.3p0 container, and encountered this excitement:

(base) root@b815efb3cc5b:/home# python3 get_refs.py 

I'm not sure what that would be, to be honest... I'm somewhat surprised this worked as written above: without specifying development I would have expected failure based on how the input file changed.

Seems related to docker/for-mac#5139... There should be no reason to run this in docker, as long as you have some version of python3 installed; if I am using something not available in an earlier version of python3, I'd like to know about it!

Checking that this also will include some additional comments within the script?

I added a few more comments, but I was not planning to add much more. Was there something specific you were looking for?

@sjspielman
Copy link
Member

I added a few more comments, but I was not planning to add much more. Was there something specific you were looking for?

Mostly looking for a docstring or so at the top of the script with 1-2 sentences about what the script does and why it might be useful.

@jashapiro
Copy link
Member Author

I added a few more comments, but I was not planning to add much more. Was there something specific you were looking for?

Mostly looking for a docstring or so at the top of the script with 1-2 sentences about what the script does and why it might be useful.

Yes, that should be there; done!

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jashapiro jashapiro merged commit 3b8b5a3 into development Nov 1, 2022
@jashapiro jashapiro deleted the jashapiro/local-refs branch November 1, 2022 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants