Nonlinear optimization

The intuitive explanation about Lagrangian duality in optimization

Please check here for the nice geometric explanation. And here for the mathematical explanation.

Nonlinear optimization

The nonlinear optimization can be written as

where is a nonlinear function w.r.t. . Since the can be extremely complicated, we may not be able to explicitly figure out the analytical solution of . Instead of the analytical solution, we usually apply iteration methods to optimize the objective, even though it may fall into local optimal.

There are four iteration methods: Gradient Descent, Newton method, Gauss-Newton method, Levenberg–Marquardt (LM).

Gradient Descent

Given a step size hyperparameter , the update rule is

This optimization strategy (but a variant SGD) has been widely adopted in various neural network update.

Newton method

Newton method try to solve this objective

It applies the Taylor expansion Note that the Jaconbian J(x) and Hessian H(x) are first and second derivation of F(x) w.r.t. x

To minimize above equation, we calculate the gradient w.r.t. and make it equal to 0. We have

However, the second derivation is usually expensive to calculate

Gauss-Newton

Gauss-Newton solve the problem from least-square perspective. , which is a scalar number, usually come from , where is a vector. This method apply taylor expansion on to first derivation. We have

So

Calculating the gradient w.r.t. , we have Note that the Jaconbian J(x) and Hessian H(x) are first and second derivation of f(x) w.r.t. x. This is different with Newton method.

However, Gauss-Newton method still has problems. First, the needs to be invertible , but this is not guaranteed. Second, if is big, the above Taylor expansion is a bad approximation. To solve these two problems, LM method is invented.

Levenberg–Marquardt (LM)

The main motivation of LM is to control the size of update step . It optimizes the below objective

where is the confidence interval, which will be updated for each optimization step.

The update rule:

For each optimization step, we calculate

If is closed to 1, the local linear is guaranteed and the approximation is good. If is larger than 1, the actual decrease is more than the approximation which means the is accelerated decreasing. And we should set a bigger confidence interval to speed decrease. If is smaller than 1, the actual decrease is less than the approximation which means the is entering a flatted area. And we should set smaller confidence interval .

The above constrained objective can be converted to dual space by applying Lagrange multipler.

The optimal satisfies

If is big which means >0 and is over the confidence interval, the will be (i.e. Gradient Descent update). If is small, then the update is like Gauss-Newton .

A SGD variant: Adam

The LM method adaptively determines the update step. In gradient descent, there is also a famous adaptive method, called Adam, which has been widely used in neural network optimization. Here is Adam update rule Check here for others introduction. :

Usually, , is learning rate.

The first two equations are momentum update (i.e. moving average). The third and fourth equations are bias corrections. Because is underestimated at the beginning (i.e. ). will gradually decrease to 0 as the optimization proceed. The last equation is the parameter update rule. The learning rate will be adjusted by . Obviously, at the beginning is small ( is used to accumulate the gradient). But after many iterations, could be big, and thus the learning rate will be adjusted to 0.

Nonlinear Optimization - Canyu Le