Advanced Usage

This page details the configuration options available in the KFACPreconditioner.

Warning: Always check the docstring for the most up to date information.

Model Registration

The KFACPreconditioner must be configured with a model that is a torch.nn.Module. The preconditioner will recursively scan the modules of the model for any known module types that are supported by K-FAC. The KNOWN_MODULES are enumerated here. Any module not registered by K-FAC will be completely ignored, and your optimizer will optimizer them normally.

Often, you may want to exclude certain modules from preconditioned. A common example is embedding layers in language models which can be too large and cause out-of-memory problems. Modules can be skipped by passing a list of regex patterns which will be matched against the each module's name and class name.

skip_layers = [
    # Skip torch.nn.Embedding layers
    "Embedding",
    # Skip modules named "encoder_head"
    "encoder_head",
]
KFACPreconditioner(model, skip_layers=skip_layers)

Preconditioning Method

K-FAC supports two preconditioning methods referred to as the inverse and eigen decomposition methods. We suggest the default eigen decomposition method (see our SC 20 paper), but this can be changed.

from kfac.preconditioner import KFACPreconditioner
from kfac.enums import ComputeMethod

preconditioner = KFACPreconditioner(model, compute_method=ComputeMethod.INVERSE)

Hyperparameters

The KFACPreconditioner takes a number of hyperparameters.

To use K-FAC efficiently, the most important parameters are the factor_update_steps and inv_update_steps. The parameters control the number of calls to KFACPreconditioner.step() between updating the factors and the inverses/eigen decompositions, respectively. In non-factor_update_steps, no intermediate data is accumulated. In non-inv_update_steps, the gradients are still preconditioned but using older inverses/eigen decompositions from previous steps.

The lr parameter should always be set to your current learning rate. There are a few other parameters which can impact training: damping, factor_decay, and kl_clip. I suggest to read the papers to learn more about these.

Distribution Strategy

There are a number of configuration options for the distribution strategy.

Gradient Worker Fraction

The distribution strategy is primarily controlled by the gradient_worker_fraction. This parameter controls the ratio of workers assigned as gradient workers and as gradient receivers for each layer. The value is in the range 1/world_size <= gradient_worker_fraction <= 1. Larger values reduce memory usage at the cost of more frequent communication and lower values reduce communication by caching more data locally. The kfac.enums.DistributedStrategy enum provides aliases for the common values COMM_OPT, MEM_OPT, and HYBRID_OPT.

COMM_OPT (gradient_worker_fraction=1) is the default communication method and the design introduced in our SC 20 paper. COMM_OPT is designed to reduce communication frequency in non-K-FAC update steps and increase maximum worker utilization.

MEM_OPT (gradient_worker_fraction=1/world_size) is based on the communication strategy of Osawa et al. (2019) and is designed to reduce memory usage at the cost of increased communication frequency.

HYBRID_OPT (gradient_worker_fraction=0.5) combines features of COMM_OPT and MEM_OPT such that some fraction of workers will simultaneously compute the preconditioned gradients for a layer and broadcast the results to a subset of the remaining workers that are not responsible for computing the gradient. This results in memory usage that is more than COMM_OPT but less than HYBRID_OPT.

Factor Assignment

K-FAC will try to optimize the placement of inverse/eigen decomposition computations across workers. This optimization criteria is determined by the assignment_strategy parameter. The default kfac.enums.AssignmentStrategy.COMPUTE will optimize the placement to reduce the estimated makespan in the computation time. kfac.enums.AssignmentStrategy.MEMORY will optimize the placement to spread out the memory consumption across workers as equally as possible.

The colocate_factors (defaults to True) controls if the A and G inverses/eigen decompositions for a single layer should be computed on the same worker. This is typically recommended unless the number of registered K-FAC layers is more than twice the number of workers/GPUs. This feature must be enabled when the gradient worker fraction is set to the MEM_OPT strategy.

Communication Optimizations

K-FAC will use communication buckets for factor all-reduces to optimize communication. The size of the bucket is controllable with allreduce_bucket_cap_mb=25.0.

The compute_eigenvalue_outer_product (defaults to True) flag will speed up preconditioning at the cost of using more memory.

The symmetry_aware (default to False) flag will take advantage of the symmetric nature of factors and inverses to only communicate the upper triangle. This can be faster if the time to flatten and unflatten the matrix is faster than the communication time for the full matrix.

Data Types

By default, K-FAC stores the factors in the data type that training is performed in and stores the inverses/eigen decompositions in float32. This is because inverse/eigen decompositions are not stable in float16 and must be performed in float32. If you want to override the data types, use the factor_dtype and inv_dtype parameters. For more numerically stable models, setting these values for float16 (if your hardware supports that) can save memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced Usage

Model Registration

Preconditioning Method

Hyperparameters

Distribution Strategy

Gradient Worker Fraction

Factor Assignment

Communication Optimizations

Data Types

Table of Contents

Getting Started

Advanced Usage

Custom Preconditioners

Clone this wiki locally