-
Notifications
You must be signed in to change notification settings - Fork 25
Advanced Usage
This page details the configuration options available in the KFACPreconditioner
.
Warning: Always check the docstring for the most up to date information.
The KFACPreconditioner
must be configured with a model that is a torch.nn.Module
.
The preconditioner will recursively scan the modules of the model for any known module types that are supported by K-FAC.
The KNOWN_MODULES
are enumerated here.
Any module not registered by K-FAC will be completely ignored, and your optimizer will optimizer them normally.
Often, you may want to exclude certain modules from preconditioned. A common example is embedding layers in language models which can be too large and cause out-of-memory problems. Modules can be skipped by passing a list of regex patterns which will be matched against the each module's name and class name.
skip_layers = [
# Skip torch.nn.Embedding layers
"Embedding",
# Skip modules named "encoder_head"
"encoder_head",
]
KFACPreconditioner(model, skip_layers=skip_layers)
K-FAC supports two preconditioning methods referred to as the inverse and eigen decomposition methods. We suggest the default eigen decomposition method (see our SC 20 paper), but this can be changed.
from kfac.preconditioner import KFACPreconditioner
from kfac.enums import ComputeMethod
preconditioner = KFACPreconditioner(model, compute_method=ComputeMethod.INVERSE)
The KFACPreconditioner
takes a number of hyperparameters.
To use K-FAC efficiently, the most important parameters are the factor_update_steps
and inv_update_steps
.
The parameters control the number of calls to KFACPreconditioner.step()
between updating the factors and the inverses/eigen decompositions, respectively.
In non-factor_update_steps
, no intermediate data is accumulated.
In non-inv_update_steps
, the gradients are still preconditioned but using older inverses/eigen decompositions from previous steps.
The lr
parameter should always be set to your current learning rate. There are a few other parameters which can impact training: damping
, factor_decay
, and kl_clip
.
I suggest to read the papers to learn more about these.
There are a number of configuration options for the distribution strategy.
The distribution strategy is primarily controlled by the gradient_worker_fraction
.
This parameter controls the ratio of workers assigned as gradient workers and as gradient receivers for each layer.
The value is in the range 1/world_size <= gradient_worker_fraction <= 1
.
Larger values reduce memory usage at the cost of more frequent communication and lower values reduce communication by caching more data locally.
The kfac.enums.DistributedStrategy
enum provides aliases for the common values COMM_OPT
, MEM_OPT
, and HYBRID_OPT
.
COMM_OPT
(gradient_worker_fraction=1
) is the default communication method and the design introduced in our SC 20 paper.
COMM_OPT
is designed to reduce communication frequency in non-K-FAC update steps and increase maximum worker utilization.
MEM_OPT
(gradient_worker_fraction=1/world_size
) is based on the communication strategy of Osawa et al. (2019) and is designed to reduce memory usage at the cost of increased communication frequency.
HYBRID_OPT
(gradient_worker_fraction=0.5
) combines features of COMM_OPT
and MEM_OPT
such that some fraction of workers will simultaneously compute the preconditioned gradients for a layer and broadcast the results to a subset of the remaining workers that are not responsible for computing the gradient.
This results in memory usage that is more than COMM_OPT
but less than HYBRID_OPT
.
K-FAC will try to optimize the placement of inverse/eigen decomposition computations across workers.
This optimization criteria is determined by the assignment_strategy
parameter.
The default kfac.enums.AssignmentStrategy.COMPUTE
will optimize the placement to reduce the estimated makespan in the computation time.
kfac.enums.AssignmentStrategy.MEMORY
will optimize the placement to spread out the memory consumption across workers as equally as possible.
The colocate_factors
(defaults to True
) controls if the A
and G
inverses/eigen decompositions for a single layer should be computed on the same worker.
This is typically recommended unless the number of registered K-FAC layers is more than twice the number of workers/GPUs.
This feature must be enabled when the gradient worker fraction is set to the MEM_OPT
strategy.
K-FAC will use communication buckets for factor all-reduces to optimize communication.
The size of the bucket is controllable with allreduce_bucket_cap_mb=25.0
.
The compute_eigenvalue_outer_product
(defaults to True
) flag will speed up preconditioning at the cost of using more memory.
The symmetry_aware
(default to False
) flag will take advantage of the symmetric nature of factors and inverses to only communicate the upper triangle. This can be faster if the time to flatten and unflatten the matrix is faster than the communication time for the full matrix.
By default, K-FAC stores the factors in the data type that training is performed in and stores the inverses/eigen decompositions in float32.
This is because inverse/eigen decompositions are not stable in float16 and must be performed in float32.
If you want to override the data types, use the factor_dtype
and inv_dtype
parameters.
For more numerically stable models, setting these values for float16 (if your hardware supports that) can save memory.
Copyright © 2021—Present by Greg Pauloski