Nonlinear optimization
The intuitive explanation about Lagrangian duality in optimization
Please check here for the nice geometric explanation. And here for the mathematical explanation.
Nonlinear optimization
The nonlinear optimization can be written as
where is a nonlinear function w.r.t. . Since the can be extremely complicated, we may not be able to explicitly figure out the analytical solution of . Instead of the analytical solution, we usually apply iteration methods to optimize the objective, even though it may fall into local optimal.
There are four iteration methods: Gradient Descent, Newton method, Gauss-Newton method, Levenberg–Marquardt (LM).
Gradient Descent
Given a step size hyperparameter , the update rule is
This optimization strategy (but a variant SGD) has been widely adopted in various neural network update.
Newton method
Newton method try to solve this objective
It applies the Taylor expansion Note that the Jaconbian J(x) and Hessian H(x) are first and second derivation of F(x) w.r.t. x
To minimize above equation, we calculate the gradient w.r.t. and make it equal to 0. We have
However, the second derivation is usually expensive to calculate
Gauss-Newton
Gauss-Newton solve the problem from least-square perspective. , which is a scalar number, usually come from , where is a vector. This method apply taylor expansion on to first derivation. We have
So
Calculating the gradient w.r.t. , we have Note that the Jaconbian J(x) and Hessian H(x) are first and second derivation of f(x) w.r.t. x. This is different with Newton method.
However, Gauss-Newton method still has problems. First, the needs to be invertible , but this is not guaranteed. Second, if is big, the above Taylor expansion is a bad approximation. To solve these two problems, LM method is invented.
Levenberg–Marquardt (LM)
The main motivation of LM is to control the size of update step . It optimizes the below objective
where is the confidence interval, which will be updated for each optimization step.
The update rule:
For each optimization step, we calculate
If is closed to 1, the local linear is guaranteed and the approximation is good. If is larger than 1, the actual decrease is more than the approximation which means the is accelerated decreasing. And we should set a bigger confidence interval to speed decrease. If is smaller than 1, the actual decrease is less than the approximation which means the is entering a flatted area. And we should set smaller confidence interval .
The above constrained objective can be converted to dual space by applying Lagrange multipler.
The optimal satisfies
If is big which means >0 and is over the confidence interval, the will be (i.e. Gradient Descent update). If is small, then the update is like Gauss-Newton .
A SGD variant: Adam
The LM method adaptively determines the update step. In gradient descent, there is also a famous adaptive method, called Adam, which has been widely used in neural network optimization. Here is Adam update rule Check here for others introduction. :
Usually, , is learning rate.
The first two equations are momentum update (i.e. moving average). The third and fourth equations are bias corrections. Because is underestimated at the beginning (i.e. ). will gradually decrease to 0 as the optimization proceed. The last equation is the parameter update rule. The learning rate will be adjusted by . Obviously, at the beginning is small ( is used to accumulate the gradient). But after many iterations, could be big, and thus the learning rate will be adjusted to 0.