-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache list_objects_v2() to speed up the file cue for cloud objects #1172
Comments
For this to work, I think I will need to switch to using ETags as hashes instead of the |
Roadmap for AWS:
Unfortunately list_objects_v2() does not return version Ids, and list_object_Versions() returns too much information (never just the most current objects). So it looks like this caching will not be version-aware and will have to fall back on HEAD requests if you |
For GCS, it might be good to just switch to ETags for the next release, then wait for cloudyr/googleCloudStorageR#179. |
Hmm.... I don't think we need to switch to ETags for hashes. We could just store the ETag as part of the metadata and use ETags instead of versions to corroborate objects. |
I thought this through a bit more, and unfortunately this batched caching feature no longer seems feasible. As I said before, On the other hand, neither These and similar problems are impossible to reconcile unless:
(2) seems impossible, so I think we have to stick with (1). |
Tried to send a feature request on their feedback form, but it's glitchy today:
|
Note to self: if it ever becomes possible to revisit this issue, I will probably need to switch
Line 227 in 13470ef
Line 249 in 13470ef
|
Taking a step back: this is actually feasible if
|
Taking another step back:
(1) ensures behavior is clear, consistent, compatible, and version-aware. (2) ensures a target reruns if it is not the current object in the bucket. (2) also makes this issue so much easier to implement. And it lets us avoid adding a new
|
Under the default settings for cloud storage,
targets
checks each and every target hash with its own AWS API call, which is extremely time-consuming. This is why https://books.ropensci.org/targets/cloud-storage.html recommendtar_cue(file = FALSE)
for large pipelines on the cloud. This is fine if you're not manually modifying objects in the bucket, but it is not ideal. It would be better to find a safer way to speed uptargets
when it checks that cloud objects are up to date.Previously I posted #1131. Versioning might not be a problem if we assume most of the objects are in their current version most of the time. However,
list_objects_v2()
operates on whole prefixes, which might slow us down because it operates on more objects than we really need. And then there's pagination to contend with. This functionality is worth revisiting, but the ideas I have so far range from painful to infeasible.The text was updated successfully, but these errors were encountered: