PyTorch Distributed Shampoo

Distributed Shampoo is a preconditioned stochastic gradient optimizer in the adaptive gradient (Adagrad) family of methods [1, 2]. It converges faster by leveraging neural network-specific structures to achieve comparable model quality/accuracy in fewer iterations or epochs at the cost of additional FLOPs and memory, or achieve higher model quality in the same number of iterations or epochs. Our implementation offers specialized support for serial, Distributed Data Parallel (DDP), Fully Sharded Data Parallel (FSDP), Per-parameter Fully Sharded Data Parallel (FSDPv2, to release in PyTorch 2.6) and Hybrid Sharding Data Parallel (HSDP) training.

Distributed Shampoo currently only supports dense parameters.

The key to tuning this optimizer is to balance accuracy, performance, and memory. This is discussed in the Step-by-Step Guide below.

Developers:

Hao-Jun Michael Shi (Meta Platforms, Inc.)
Tsung-Hsien Lee
Anna Cai (Meta Platforms, Inc.)
Runa Eschenhagen (University of Cambridge)
Shintaro Iwasaki (Meta Platforms, Inc.)
Ke Sang (Meta Platforms, Inc.)
Wang Zhou (Meta Platforms, Inc.)

with contributions and support from:

Ganesh Ajjanagadde (Meta), Rohan Anil (Google), Adnan Aziz (Meta), Pavan Balaji (Meta), Shuo Chang (Meta), Weiwei Chu (Meta), Assaf Eisenman (Meta), Will Feng (Meta), Zhuobo Feng (Meta), Jose Gallego-Posada (Mila / Meta Platforms, Inc.), Avirup Ghosh (Meta), Yizi Gu (Meta), Vineet Gupta (Google), Yuchen Hao (Meta), Brian Hirsh (Meta), Yusuo Hu (Meta), Yuxi Hu (Meta), Minhui Huang (Meta), Guna Lakshminarayanan (Meta), Michael Lazos (Meta), Zhijing Li (Meta), Ming Liang (Meta), Wanchao Liang (Meta), Ying Liu (Meta), Wenguang Mao (Meta), Dheevatsa Mudigere (NVIDIA), Maxim Naumov (Meta), Jongsoo Park (Meta), Mike Rabbat (Meta), Kaushik Rangadurai (Meta), Dennis van der Staay (Meta), Fei Tian (Meta), Rohan Varma (Meta), Sanjay Vishwakarma (Meta), Xunnan (Shawn) Xu (Meta), Jiyan Yang (Meta), Chunxing Yin (Meta), Iris Zhang (Meta), Chuanhao Zhuge (Meta), and Will Zou (Meta).

Features

Key distinctives of this implementation include:

Homogeneous multi-node multi-GPU support in PyTorch.
Learning rate grafting [3]. Our version of grafting only grafts the second moment/diagonal preconditioner. Momentum/first moment updates are performed separate from grafting. Supports the methods:
- SGD
- Adagrad
- RMSProp
- Adam
Supports both normal and AdamW (decoupled) weight decay.
Incorporates exponential moving averaging (with or without bias correction) to the estimate the first moment (akin to Adam).
Incorporates momentum and Nesterov acceleration.
Offers multiple approaches for computing the root inverse, including:
- Using symmetric eigendecomposition (used by default).
- Coupled inverse Newton iteration [4].
- Higher-order coupled iterations with relative epsilon based on estimate of largest eigenvalue.
Choice of precision for preconditioner accumulation and root inverse computation.
Ability to cache split parameters.
Merging of small dimensions.
[EXPERIMENTAL] Option to (approximately) correct the eigenvalues/run Adam in the eigenbasis of Shampoo's preconditioner (SOAP) [2,6,7].

Requirements

We have tested this implementation on the following versions of PyTorch:

PyTorch >= 2.5;
Python >= 3.10;
CUDA 11.3-11.4; 12.2+;

Note: We have observed known instabilities with the torch.linalg.eigh operator on CUDA 11.6-12.1, specifically for low-rank matrices, which may appear with using a small start_preconditioning_step. Please avoid these versions of CUDA if possible. See: pytorch/pytorch#94772.

How to Use

Given a learning rate schedule for your previous base optimizer, we can replace the optimizer with Shampoo and "graft" from the learning rate schedule of the base method. Alternatively, you can consider replacing Adam(W) by eigenvalue-corrected Shampoo (SOAP).

A few notes on hyperparameters:

Notice that Shampoo contains some new hyperparameters (max_preconditioner_dim and precondition_frequency) that are important for performance. We describe how to tune these below in the section on Hyperparameter Tuning.
Here, betas refer to the hyperparameters used for the exponential moving average of the gradients and Shampoo preconditioners, while grafting_beta2 corresponds to the beta2 used specifically for exponential moving averaging of the grafted method. This is similar for epsilon and grafting_epsilon. As a first choice, we recommend setting betas equal to the previous betas and additionally setting grafting_beta2 equal to betas[1], and set epsilon = 1e-12 and grafting_epsilon equal to the previous epsilon.
We also distinguish between beta1 and momentum. beta1 corresponds to the EMA of the gradients (or gradient filtering), while momentum corresponds to the SGD momentum formula applied to the search direction.
We allow for decoupled and coupled weight decay. If one sets use_decoupled_weight_decay=True, then you are enabling AdamW-style weight decay, while use_decoupled_weight_decay=False corresponds to the normal L2-regularization style weight decay.
When setting preconditioner_config as an instance of EigenvalueCorrectedShampooPreconditionerConfig (see Example 5), there is typically no need to use learning rate grafting from Adam (grafting_config=None) and, when they are available, Adam's optimal lr, betas, and weight_decay should be a good starting point for further tuning. However, the case of beta2=1.0, i.e. an AdaGrad-like accumulation, has not been explored yet. Also, in settings where Shampoo would usually graft its learning rate from SGD, grafting might still be beneficial.