-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix overwrite directories #86
Conversation
src/dataregistry/registrar_util.py
Outdated
elif dataset_organization == "directory": | ||
# If the directory already exists, need to delete it first | ||
if os.path.isdir(dest): | ||
rmtree(dest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything goes wrong with the subsequent copy, the old dataset will be gone and it won't be possible to recover the old state. It would be safer to mv
the old dataset somewhere (e.g. to something with a path similar to the true destination, but perhaps starting with a .
or with strange characters - emacs uses #
at the beginning and end of the filename - strategically placed), copy in the new data, then if that succeeds do rmtree.
If we're feeling really paranoid, the rmtree shouldn't happen until after the db row is successfully created. And for datasets which are just files we should also do just a mv
initially to a similar name, then rm
after the database entry is made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it as suggested.
When the destination exists, mv it to a backup, copy the data, then delete the backup if successful. If the try
fails, mv the backup back to the original.
If its an individual file being ingested, a checksum is also performed.
I think putting the entire database entry in a try would be a bit cumbersome, to check if the entry is also sucessful before deleting the backup.
…py is done, and the backup is deleted is success
Now the order is reversed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're better off with your rearrangement (make tentative db entry; then update if everything succeeds) but it would be neater if you were consistent about reporting errors.
I don't think I need to see this again, but I suggest you change _copy_file
to just throw an exception rather than returning a bad status (and then _handle_data
doesn't need to return a status either). I made some inline comments all saying essentially the same thing.
src/dataregistry/registrar.py
Outdated
|
||
return dataset_organization, num_files, total_size, ds_creation_date | ||
return dataset_organization, num_files, total_size, ds_creation_date, success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably don't need to add another return value for "success". E.g., setting all the other return values to None
could signal failure.
Oh, I see: it's specifically for copy failure. But for another type of failure (file not found) _handle_data
raises an exception. In the end, the result is the same: the database entry is there but marked as invalid, which is fine. I think the code will be clearer if all errors are handled the same way. Probably always throwing an exception is simplest. At the end of _copy_data
you could just re-raise the original exception, possibly with a bit of extra explanatory text, like the print message which is in there now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I've changed the code to raise an exception in copy_data, and remove the extra error handing.
Overwriting directories requires deleting the previous directory to work.
In this branch