Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix overwrite directories #86

Merged
merged 6 commits into from
Dec 6, 2023
Merged

Conversation

stuartmcalpine
Copy link
Collaborator

Overwriting directories requires deleting the previous directory to work.

In this branch

  • Moved copying data to a function in registrar_util
  • When copying a directory, it deletes it first if it exists
  • Added a CI test for the copy_data function

elif dataset_organization == "directory":
# If the directory already exists, need to delete it first
if os.path.isdir(dest):
rmtree(dest)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything goes wrong with the subsequent copy, the old dataset will be gone and it won't be possible to recover the old state. It would be safer to mv the old dataset somewhere (e.g. to something with a path similar to the true destination, but perhaps starting with a . or with strange characters - emacs uses # at the beginning and end of the filename - strategically placed), copy in the new data, then if that succeeds do rmtree.

If we're feeling really paranoid, the rmtree shouldn't happen until after the db row is successfully created. And for datasets which are just files we should also do just a mv initially to a similar name, then rm after the database entry is made.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it as suggested.

When the destination exists, mv it to a backup, copy the data, then delete the backup if successful. If the try fails, mv the backup back to the original.

If its an individual file being ingested, a checksum is also performed.

I think putting the entire database entry in a try would be a bit cumbersome, to check if the entry is also sucessful before deleting the backup.

…py is done, and the backup is deleted is success
@stuartmcalpine
Copy link
Collaborator Author

Now the order is reversed:

  • First the entry in the database is created (at this stage is_valid=False for the dataset)
  • If successful, the data is then copied
  • If the copy was successful, change is_valid=True and append the dataset entry with the file info (nfiles, file_size etc)

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're better off with your rearrangement (make tentative db entry; then update if everything succeeds) but it would be neater if you were consistent about reporting errors.
I don't think I need to see this again, but I suggest you change _copy_file to just throw an exception rather than returning a bad status (and then _handle_data doesn't need to return a status either). I made some inline comments all saying essentially the same thing.


return dataset_organization, num_files, total_size, ds_creation_date
return dataset_organization, num_files, total_size, ds_creation_date, success
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need to add another return value for "success". E.g., setting all the other return values to None could signal failure.
Oh, I see: it's specifically for copy failure. But for another type of failure (file not found) _handle_data raises an exception. In the end, the result is the same: the database entry is there but marked as invalid, which is fine. I think the code will be clearer if all errors are handled the same way. Probably always throwing an exception is simplest. At the end of _copy_data you could just re-raise the original exception, possibly with a bit of extra explanatory text, like the print message which is in there now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've changed the code to raise an exception in copy_data, and remove the extra error handing.

src/dataregistry/registrar.py Show resolved Hide resolved
src/dataregistry/registrar.py Outdated Show resolved Hide resolved
src/dataregistry/registrar.py Outdated Show resolved Hide resolved
src/dataregistry/registrar_util.py Outdated Show resolved Hide resolved
@stuartmcalpine stuartmcalpine merged commit 96b1a50 into main Dec 6, 2023
8 checks passed
@stuartmcalpine stuartmcalpine deleted the u/stuart/overwrite_datasets branch December 6, 2023 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants