Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check MD5 hash before downloading #267

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented Apr 1, 2023

Overview

Proof of concept: downloading an MD5 file and checking against a local copy of the hash the last time the file was downloaded, then doing something if they are different.

Addresses

Addresses no GitHub issue yet created. However, addresses the problem of our pipeline needlessly spending time downloading files even when they have not changed.

@joeflack4 joeflack4 requested a review from matentzn April 1, 2023 23:11
@joeflack4 joeflack4 self-assigned this Apr 1, 2023
@joeflack4 joeflack4 added the enhancement New feature or request label Apr 1, 2023
@joeflack4 joeflack4 marked this pull request as draft April 1, 2023 23:12
@joeflack4 joeflack4 linked an issue Apr 1, 2023 that may be closed by this pull request
component-download-omim.owl: | $(TMPDIR)
echo couldn't find a way that worked

# failed attempts
Copy link
Contributor Author

@joeflack4 joeflack4 Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: got something working; see below

I had a feeling this first attempt using ShellScript wouldn't work out. I'm not great at ShellScript, much less nesting commands, much less within make. I spent a little over an hour trying many variations.

I think I could do this if I was able to refactor the way we're downloading all of our components, perhaps (a) by splitting into multiple goals / introducing additional variables and/or temp files. Even easier if (b) leveraging Python. But, especially if that involves changing ODK, I can't see those methods being approved.

Probably best to close this and not worry about it for now, unless you want to take a stab or have me try out one of the above proposals.

Copy link
Contributor Author

@joeflack4 joeflack4 Apr 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matentzn Some success!

I should have realized earlier that this was the perfect use case for ChatGPT. Although what it suggested was not different, than one variation I saw on StackOverflow, for whatever reason I copied what it had literally this time. My problem when tried this earlier was that I included [brackets] around my if statement, copying what we are doing with with a lot of our component goals. When I removed the brackets as ChatGPT (and StackOverflow) suggested, though, it worked.

That is, this failed:
if [ diff $(TMPDIR)/tmp.md5 $(TMPDIR)/omim.ttl.md5 >/dev/null 2>&1 ]; then ...

But this succeeded:
if diff $(TMPDIR)/tmp.md5 $(TMPDIR)/omim.ttl.md5 >/dev/null 2>&1; then ...

More specifically, this works:

component-download-omim.owl: | $(TMPDIR)
	if [ $(MIR) = true ] && [ $(COMP) = true ]; then wget $(OMIM_URI).md5 -O $(TMPDIR)/tmp.md5; \
	if diff $(TMPDIR)/tmp.md5 $(TMPDIR)/omim.ttl.md5 >/dev/null 2>&1; \
		then echo same; else echo different; fi; fi

This is why I don't like ShellScript for anything more than a few lines; it's so arcane. Even something as short as this trips me up.


Now that we have a working proof of concept, need to decide what to do next.

One thing we need to do is keep a local copy of these raw downloads, e.g. omim.ttl, or at least keep its MD5 the last time it ran, e.g. $(TMPDIR)/omim.ttl.md5. Thus, we would need to change it so that robot doesn't take as input the URLs of these artefacts (e.g. $(ROBOT) merge -I https://github.com/monarch-initiative/omim/releases/latest/download/omim.ttl; we'd have to download it first, if that's alright with you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets discuss in next call! the principle sounds great, but I would like it to be explained once face2face

…l copy of the hash the last time the file was downloaded, then doing something if they are different.
@joeflack4 joeflack4 force-pushed the md5-download-checks branch from a5040b3 to b3517e4 Compare April 2, 2023 06:18
@joeflack4 joeflack4 force-pushed the main branch 2 times, most recently from 747fceb to 01b3bbc Compare May 8, 2023 02:10
@joeflack4 joeflack4 mentioned this pull request Aug 29, 2023
@joeflack4 joeflack4 requested a review from twhetzel January 31, 2024 01:12
@joeflack4
Copy link
Contributor Author

@twhetzel FYI this is something I'd like to do at some point that would help reduce build-mondo-ingest time but especially make things run faster / easier locally. Refer to my comment in OP for what this is about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faster builds: md5
2 participants