Fix overwrite directories #86

stuartmcalpine · 2023-11-29T11:43:34Z

Overwriting directories requires deleting the previous directory to work.

In this branch

Moved copying data to a function in registrar_util
When copying a directory, it deletes it first if it exists
Added a CI test for the copy_data function

JoanneBogart · 2023-12-01T00:29:54Z

src/dataregistry/registrar_util.py

+    elif dataset_organization == "directory":
+        # If the directory already exists, need to delete it first
+        if os.path.isdir(dest):
+            rmtree(dest)


If anything goes wrong with the subsequent copy, the old dataset will be gone and it won't be possible to recover the old state. It would be safer to mv the old dataset somewhere (e.g. to something with a path similar to the true destination, but perhaps starting with a . or with strange characters - emacs uses # at the beginning and end of the filename - strategically placed), copy in the new data, then if that succeeds do rmtree.

If we're feeling really paranoid, the rmtree shouldn't happen until after the db row is successfully created. And for datasets which are just files we should also do just a mv initially to a similar name, then rm after the database entry is made.

I've changed it as suggested.

When the destination exists, mv it to a backup, copy the data, then delete the backup if successful. If the try fails, mv the backup back to the original.

If its an individual file being ingested, a checksum is also performed.

I think putting the entire database entry in a try would be a bit cumbersome, to check if the entry is also sucessful before deleting the backup.

…py is done, and the backup is deleted is success

…data

stuartmcalpine · 2023-12-05T02:53:27Z

Now the order is reversed:

First the entry in the database is created (at this stage is_valid=False for the dataset)
If successful, the data is then copied
If the copy was successful, change is_valid=True and append the dataset entry with the file info (nfiles, file_size etc)

JoanneBogart

I think we're better off with your rearrangement (make tentative db entry; then update if everything succeeds) but it would be neater if you were consistent about reporting errors.
I don't think I need to see this again, but I suggest you change _copy_file to just throw an exception rather than returning a bad status (and then _handle_data doesn't need to return a status either). I made some inline comments all saying essentially the same thing.

JoanneBogart · 2023-12-05T23:27:18Z

src/dataregistry/registrar.py


-        return dataset_organization, num_files, total_size, ds_creation_date
+        return dataset_organization, num_files, total_size, ds_creation_date, success


You probably don't need to add another return value for "success". E.g., setting all the other return values to None could signal failure.
Oh, I see: it's specifically for copy failure. But for another type of failure (file not found) _handle_data raises an exception. In the end, the result is the same: the database entry is there but marked as invalid, which is fine. I think the code will be clearer if all errors are handled the same way. Probably always throwing an exception is simplest. At the end of _copy_data you could just re-raise the original exception, possibly with a bit of extra explanatory text, like the print message which is in there now.

Agreed. I've changed the code to raise an exception in copy_data, and remove the extra error handing.

src/dataregistry/registrar.py

src/dataregistry/registrar_util.py

stuartmcalpine added 2 commits November 29, 2023 12:41

Moved copying data to registrar_util. Overwriting directories now works

670e9af

Update the docs for overwriting datasets

807c33f

stuartmcalpine requested a review from JoanneBogart November 29, 2023 12:01

stuartmcalpine mentioned this pull request Nov 29, 2023

Overwriting datasets #75

Closed

JoanneBogart requested changes Dec 1, 2023

View reviewed changes

Add saftey when overwirting datasets, now a backup is created, the co…

e2b1752

…py is done, and the backup is deleted is success

stuartmcalpine requested a review from JoanneBogart December 1, 2023 09:30

Make sure the entry in the database is successful before copying any …

74eb768

…data

JoanneBogart approved these changes Dec 6, 2023

View reviewed changes

stuartmcalpine added 2 commits December 6, 2023 09:13

Address reviewer comments

d8ce32c

Black formatting

080f262

stuartmcalpine merged commit 96b1a50 into main Dec 6, 2023
8 checks passed

stuartmcalpine deleted the u/stuart/overwrite_datasets branch December 6, 2023 08:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix overwrite directories #86

Fix overwrite directories #86

stuartmcalpine commented Nov 29, 2023

JoanneBogart Dec 1, 2023

stuartmcalpine Dec 1, 2023

stuartmcalpine commented Dec 5, 2023

JoanneBogart left a comment

JoanneBogart Dec 5, 2023

stuartmcalpine Dec 6, 2023


		return dataset_organization, num_files, total_size, ds_creation_date
		return dataset_organization, num_files, total_size, ds_creation_date, success

Fix overwrite directories #86

Fix overwrite directories #86

Conversation

stuartmcalpine commented Nov 29, 2023

JoanneBogart Dec 1, 2023

Choose a reason for hiding this comment

stuartmcalpine Dec 1, 2023

Choose a reason for hiding this comment

stuartmcalpine commented Dec 5, 2023

JoanneBogart left a comment

Choose a reason for hiding this comment

JoanneBogart Dec 5, 2023

Choose a reason for hiding this comment

stuartmcalpine Dec 6, 2023

Choose a reason for hiding this comment