Adding the ability to specify a checkpointing for models before aggregation. #128

emersodb · 2024-04-23T20:42:32Z

Note: There are a scary number of files changed because this affects the functionality of BasicClient in a way that requires slight modifications of many of the inheriting classes. Most of the changes should be quite minor in those downstream classes.

PR Type

Feature

Clickup Ticket(s):

Refactoring the client-side checkpointing functionality to allow for pre- and post-aggregation (server-side) checkpointing. That is, the user can specify a checkpointer for models after the weights have been aggregated by the server (still supported, but was the only default option), or prior to this happening (i.e. right after training). If either of the checkpointers are not specified, then checkpointing at that time is skipped. Note, this allows us to mimic "fine-tuning" of models. That is, the models final training is exclusively local.

For example, with FedProx, the model checkpointed would actually be a "personal" model for each client (potentially after several preceding rounds of aggregation).

Also allowing for more generic checkpointing functionality. That is, given a loss value and a dictionary of metrics, users can define an arbitrary scoring function on those objects to produce a checkpoint. The best loss checkpointer is a specific instantiation of this type of checkpointer.

Tests Added

Added some new tests to cover the new ClientSideCheckpointModule functionality and also the new functionality associated with the TorchCheckpointer child classes.

…g functionality to allow for pre- and post-aggregation (server-side) checkpointing. Also allowing for more generic checkpointing functionality. That is, given a loss value and a dictionary of metrics, users can define an arbitrary scoring function on those objects to produce a checkpoint. The best loss checkpointer is a specific instantiation of this type of checkpointer.

emersodb · 2024-04-23T20:43:32Z

fl4health/checkpointing/checkpointer.py

+            checkpoint_dir, checkpoint_name, checkpoint_score_function=loss_score_function, maximize=False
+        )
+
+    def maybe_checkpoint(self, model: nn.Module, loss: float, metrics: Dict[str, Scalar]) -> None:


Note, I'm overriding this method to replace the logging with something more specific. If anyone has any better ideas on how to do this, let me know.

emersodb · 2024-04-23T20:45:58Z

fl4health/clients/basic_client.py

@@ -631,8 +657,6 @@ def validate(self) -> Tuple[float, Dict[str, Scalar]]:
        metrics = self.val_metric_manager.compute()
        self._handle_logging(loss_dict, metrics, is_validation=True)

-        # Checkpoint based on loss which is output of user defined compute_loss method
-        self._maybe_checkpoint(loss_dict["checkpoint"])


Note that I moved this into the evaluate function rather than the validate function, as we may not want to checkpoint every time we validate etc.

…lass

fatemetkl

LGTM! Checkpoint score functions, and pre-aggregation vs post-aggregation checkpointing are nice additions towards flexibility.

emersodb requested review from lotif, fatemetkl, jewelltaylor, sanaAyrml and yc7z April 23, 2024 20:42

emersodb commented Apr 23, 2024

View reviewed changes

emersodb added 3 commits April 23, 2024 16:50

Removing an unnecessary argument for the new ClientSideCheckpointModule

9c011b8

Small name refactor to simplify.

7f2f2a3

Small fix and change to the inheritance for the latest checkpointer c…

4a8dc99

…lass

fatemetkl approved these changes May 6, 2024

View reviewed changes

Base automatically changed from dbe/implement_fed_rep to main May 8, 2024 17:54

Merge branch 'main' into dbe/adding_checkpointer_post_train

b7eb070

emersodb merged commit 95a60c8 into main May 8, 2024
6 checks passed

emersodb deleted the dbe/adding_checkpointer_post_train branch May 8, 2024 19:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the ability to specify a checkpointing for models before aggregation. #128

Adding the ability to specify a checkpointing for models before aggregation. #128

emersodb commented Apr 23, 2024

emersodb Apr 23, 2024

emersodb Apr 23, 2024

fatemetkl left a comment

Adding the ability to specify a checkpointing for models **before** aggregation. #128

Adding the ability to specify a checkpointing for models **before** aggregation. #128

Conversation

emersodb commented Apr 23, 2024

PR Type

Tests Added

emersodb Apr 23, 2024

Choose a reason for hiding this comment

emersodb Apr 23, 2024

Choose a reason for hiding this comment

fatemetkl left a comment

Choose a reason for hiding this comment

Adding the ability to specify a checkpointing for models before aggregation. #128

Adding the ability to specify a checkpointing for models before aggregation. #128