Skip to content

OptimKit: A blissfully ignorant Julia package for gradient optimization

License

Notifications You must be signed in to change notification settings

Jutho/OptimKit.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OptimKit.jl

A blisfully ignorant Julia package for gradient optimization.


A package for gradient optimization that tries to know or assume as little as possible about your optimization problem. So far gradient descent, conjugate gradient and the L-BFGS method have been implemented. One starts by defining the algorithm that one wants to use by creating an instance algorithm of either

  • GradientDescent(; params...)
  • ConjugateGradient(; flavor = ..., params...)
  • LBFGS(m::Int; params...)

All of them take a number of parameters, namely

  • maxiter: number of iterations (defaults to typemax(Int), so essentially unbounded)
  • gradtol: convergence criterion, stop when the 2-norm of the gradient is smaller than gradtol (default value = gradtol = 1e-8)
  • linesearch: which linesearch algorithm to be used, currently there is only one choice, namely HagerZhangLineSearch(;...) (see below).
  • verbosity: Verbosity level, the amount of information that will be printed, either <=0 (default value) for no information, 1 for a single STDOUT output at the end of the algorithm, or >=2 for a one-line summary after every iteration step.

Furthermore, LBFGS takes a single positional argument m::Int, the number of previous steps to take into account in the construction of the approximate (inverse) Hessian. It also takes a keyword argument acceptfirst, which determines whether the first guess for alpha in the line search can be accepted. The default value is true, which typically leads to less function evaluations (otherwise at least two function evaluations per iteration are required), despite a more erratic convergence of the gradient norm.

ConjugateGradient also has one additional keyword argument, flavor, which can be any of the following:

  • HagerZhang(; η::Real = 0.4, θ::Real = 1.0): default
  • HestenesStiefel()
  • PolakRibiere()
  • DaiYuan()

The linesearch argument currently takes the value

HagerZhangLineSearch(; δ::Real = .1, # parameter for first Wolfe condition
                        σ::Real = .9, # paremeter for second Wolfe condition
                        ϵ::Real = 1e-6, # parameter for approximate Wolfe condition, accept fluctation of ϵ on the function value
                        θ::Real = 1/2, # used in bisection
                        γ::Real = 2/3, # determines when a bisection step is performed
                        ρ::Real = 5., # expansion parameter for initial bracket interval
                        verbosity::Int = 0)

The linesearch has an independent verbosity flag to control the output of information being printed to STDOUT, but by default its value is equal to verbosity-2 of the optimization algorithm. So ConjugateGradient(; verbosity = 3) is equivalent to having verbosity=1 in the linesearch algorithm.

This optimization algorithm can then be applied by calling

x, fx, gx, numfg, normgradhistory = optimize(fg, x₀, algorithm; kwargs...)

Here, the optimization problem (objective function) is specified as a function fval, gval = fg(x) that returns both the function value and its gradient at a given point x. The function value fval is assumed to be a real number of some type T<:Real. Both x and the gradient gval can be of any type, including tuples and named tuples. As a user, you should then also specify the following functions via keyword arguments

  • Pη = precondition(x, η): apply a preconditioner to the current gradient or tangent vector η at the position x
  • x, f, g = finalize!(x, f, g, numiter): after every step (i.e. upon completion of the linesearch), allows to modify the position and corresponding function value or gradient, or to do other things like printing out statistics. Note that this step happens before computing new directions in Conjugate Gradient and LBFGS, so if f and g are modified, this is at the user's own risk (e.g. Wolfe conditions might no longer be satisfied, ...).
  • x, ξ = retract(x₀, η, α): take a step in direction η (same type as gradients) starting from point x₀ and with step length α, returns the new x(α) = Rₓ₀(α * η) and the local tangent to this path at that position, i.e. ξ = D Rₓ₀(α * η)[η] (informally, ξ = dx(α) / dα).
  • s = inner(x, ξ1, ξ2): compute the inner product between two gradients or similar objects at position x. The x dependence is useful for optimization on manifolds, where this function represents the metric; in particular it should be symmetric inner(x, ξ1, ξ2) == inner(x, ξ2, ξ1) and real-valued.
  • η = scale!(η, β): compute the equivalent of η*β, possibly in place, but we always use the return value. This is mostly used as scale!(g, -1) to compute the negative gradient as part of the step direction.
  • η = add!(η, ξ, β): compute the equivalent of η + ξ*β, possibly overwriting η in place, but we always use the return value
  • ξ = transport!(ξ, x, η, α, x′): transport tangent vector ξ along the retraction of x in the direction η (same type as a gradient) with step length α, can be in place but the return value is used. Transport also receives x′ = retract(x, η, α)[1] as final argument, which has been computed before and can contain useful data that does not need to be recomputed

Note that the gradient g of the objective function should satisfy d f(x(α)) / d α = inner(x(α), ξ(α), g(x(α))). There is a utility function optimtest to facilitate testing this compatibility relation for your given choice of fg, retract and inner.

The GradientDescent algorithm only requires the first three, ConjugateGradient and LBFGS require all five functions. Default values are provided to make the optimization algorithms work with standard optimization problems where x is a vector or Array, i.e. they are given by

_retract(x, η, α) = (x + α * η, η)
_inner(x, ξ1, ξ2) = ξ1 === ξ2 ? LinearAlgebra.norm(ξ1)^2 : LinearAlgebra.dot(ξ1, ξ2)
_transport!(ξ, x, η, α) = ξ
_add!(η, ξ, β) = LinearAlgebra.axpy!(β, ξ, η)
_scale!(η, β) = LinearAlgebra.rmul!(η, β)

Finally, there is one keyword argument isometrictransport::Bool to indicate whether the transport of vectors preserves their inner product, i.e. whether

inner(x, ξ1, ξ2) == inner(retract(x, η, α), transport!(ξ1, x, η, α), transport!(ξ2, x, η, α))

The default value is false, unless the default transport (_transport!) and inner product (_inner) are used. However, convergence of conjugate gradient and LBFGS is more robust (or theoretically proven) in the case of isometric transport. Note that isometric transport might not be the same as retraction transport, and thus, in particular ξ != transport(η, x, η, α, x′). However, when isometric transport is provided, we complement it with an isometric rotation such that ξ = D Rₓ₀(α * η)[η] and transport(η, x, η, α) are parallel in the case of LBFGS. This is the so-called locking condition of Huang, Gallivan and Absil, and the approach is described in section 4.1.