Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating data sync models. adding results model #16

Merged
merged 5 commits into from
Nov 12, 2024

Conversation

rpmcginty
Copy link
Collaborator

@rpmcginty rpmcginty commented Nov 8, 2024

What's in this Change?

Primarily meant to address the following bug - https://alleninstitute.atlassian.net/browse/OCSDV-354

  • caching of botocore config
  • introducing DataSyncResult and BatchDataSyncResult
    • summarizes data that is transferred
    • allows to conditionally update
  • for batch jobs, we add the ability for partial success of transfers (by default this is disabled)

Primarily meant to address the following bug - https://alleninstitute.atlassian.net/browse/OCSDV-354

This also introduces a new approach to gathering results during data sync. Previously the data sync operations returned nothing and the lambda handlers would simply return the requests in the response. For a few reasons, this is not great and so I have introduced a more informative and cohesive result object that summarizes data sync operations based on bytes transferred, files transferred and for the batch case, the number of succeeded and failed tasks.

Connected PRs

Testing

  • unit tests
  • e2e tests in merscope analysis dev

Comment on lines 156 to 165
@dataclass
class DataSyncResult(SchemaModel):
bytes_transferred: Optional[int] = None
files_transferred: Optional[int] = None

def add_bytes_transferred(self, bytes_transferred: int) -> None:
self.bytes_transferred = (self.bytes_transferred or 0) + bytes_transferred

def add_files_transferred(self, files_transferred: int) -> None:
self.files_transferred = (self.files_transferred or 0) + files_transferred
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a meaningful difference between having the default be None vs having bytes_transferred and files_transferred default to 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main difference is that it helps distinguish a result that explicitly tracks the bytes and files transferred vs a result where those things are not tracked. otherwise we might confuse ourselves with a result that says zero bytes are transferred and think something is wrong

Copy link
Collaborator

@njmei njmei Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer for a data sync where "results are not tracked" to not return a DataSyncResult at all (and instead return None or something).

@rpmcginty rpmcginty requested a review from njmei November 9, 2024 00:53
@rpmcginty rpmcginty merged commit dffe636 into main Nov 12, 2024
4 checks passed
@rpmcginty rpmcginty deleted the bugfix/data-sync-memory-leak branch November 12, 2024 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants