Skip to content
This repository has been archived by the owner on Jan 5, 2021. It is now read-only.

Make Bag Validation and Extraction Asynchronous to the Rest of Batch Import #857

Open
ntallman opened this issue Mar 8, 2019 · 0 comments

Comments

@ntallman
Copy link
Contributor

ntallman commented Mar 8, 2019

Batches (zipped bags) have the potential to be exceedingly large. While there may end up being an effective limit, the could in theory by a TB+. Processing even a 45 GB bag with CHO MVP 5 release batch import is a lengthy process. Parts of the process could become asynchronous.

Users could stage zipped bags in the network share. When CHO detects a new bag (would somehow have to know it's a whole file and not a file that still being copied over) it could start validating the bag and if valid, begin extracting it.

When users initiate a batch import from the GUI, they would have the ability to select completed, pre-validated and extracted bags from a dropdown. If not this, they somehow need to know which bags have been pre-processed. After the upload a CSV, the rest of the batch import process happens.

Any zip- and bag-level errors found in the pre-processing would need to be captured at the time and reported during the import preview. If at the time this ticket is worked on, we have messaging and activity streams, it would be nice if malformed zip or invalid-bag errors were immediately reported to the submitter via email.

Related: #848, #826

Related: Parallel Checksum Generation in Ruby Bagit

@ntallman ntallman added this to the 1.x Migration Ready milestone Mar 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant