2023-05-10
2023-05-11
2023-05-26
I would like to introduce the article of the site https://distill.pub/2017/momentum/.
This article describes some of the properties and disadvantages of gradient descent.
The main part of this article is to introduce how to use momentum to improve the above disadvantages.
Table of Contents
GD and Momentum
More analysis of GD
More analysis of Momentum
Other
Let \(f:\mathbb R^n \longrightarrow \mathbb R\) be a differentiable function.
Given a random point \(w^0 \in \mathbb R^n.\)
Given any \(\alpha\in (0,\infty)\) which we call it the step size.
We iterate \(w^{k}\) by \[ \begin{aligned} w^{k+1} = w^k - \alpha \nabla f(w^k). \end{aligned} \]
Note that for each fixed \(w\in \mathbb R^n,\) \[ \begin{aligned} f(w+h) = f(w)+ \nabla f(w)^{\mathtt T}\cdot h + o(\left\lVert h \right\rVert) \end{aligned} \] for \(\left\lVert h \right\rVert \rightarrow 0.\)
We choose \(h=-\alpha \nabla f(w).\)
Since \(\color{blue}{\nabla f(w)^{\mathtt{T}}\cdot h < 0},\) we have \[
\begin{aligned}
f\bigl( \underbrace{w-\alpha \nabla f(w)}_{w^{k+1}} \bigr) = f(w+h) < f(\underbrace{w}_{w^k})
\end{aligned}
\]
for \(\alpha>0\) small enough.
Question
What is the range of \(\alpha\)?
GD often gets stuck in local minima.
GD will be very slow if the function is in the following cases.
How to fix this condition?
Momentum was proposed by Polyak (1964).
The idea of GD is the method of steepest descent. But we do not need every step to be the steepest.
We may consider how a ball descends to its lowest point.
Given a random point \(w^0 \in \mathbb R^n.\)
Set \(z^0=\mathbf{0}\in \mathbb R^n.\)
Given any \(\alpha,\beta\in (0,\infty).\)
We give gradient descent a short-term memory:
We iterate \(z^k,w^k\) by \[ \begin{aligned} z^{k+1} &= \beta z^k + \nabla f(w^k), \cr w^{k+1} &= w^k - \alpha z^{k+1}. \end{aligned} \]
Note that \[ \begin{aligned} f(w^k)^{\mathtt{T}} \cdot (-1)\bigl( \underbrace{\beta z^k + \nabla f(w^k)}_{z^{k+1}} \bigr) < 0 \end{aligned} \] for \(\beta\) small enough.
This shows that \[
\begin{aligned}
f\bigl( \underbrace{w^k-\alpha z^{k+1}}_{w^{k+1}} \bigr) < f(w^k)
\end{aligned}
\]
for \(\alpha,\beta>0\) small enough.
Question
What is the range of \(\alpha,\beta\)?
By looking at this example, we naturally have the following questions:
Question
How slowly does GD converge?
To measure “how slowly”, we need to introduce the convergence speed.
Question
In GD, what’s the range of \(\alpha\)?
Question
In Momentum, what’s the range of \(\alpha,\beta\)?
Question
How much does Momentum accelerate?
Question
Momentum has the effect of increasing the range of allowed step sizes. Why? How much is the increase?
A sequence \(\lbrace x_k \rbrace_k\) that converges to \(x^*\) (in a norm \(\left\lVert \cdot \right\rVert\)) is said to have convergence rate \(\mu\) (order \(1\)) if \[ \begin{aligned} \lim_{n\rightarrow\infty} \frac{\left\lVert x_{n+1} - x^* \right\rVert}{\left\lVert x_{n} -x^* \right\rVert} = \mu. \end{aligned} \]
For example:
If \(x_k = (1/3)^kc_0,\) then the convergence rate of \(\lbrace x_k \rbrace_k\) is \(1/3.\)
If \(x_k = (1/2)^kc_1 + (-1/3)^k c_2,\) then the convergence rate of \(\lbrace x_k \rbrace\) is \(1/2.\)
Roughly speaking, if \(\lbrace x_k \rbrace,\lbrace y_k \rbrace\) have convergence rates \(\mu_1,\mu_2\) respectively and \(0<\mu_1<\mu_2<1,\) then \(\lbrace x_k \rbrace_k\) converges faster than \(\lbrace y_k \rbrace_k.\)
Table of Contents
GD and Momentum
More analysis of GD
More analysis of Momentum
Other
From now on, we consider the convex quadratic:
Assume \(A\) is symmetric, positive definite and invertible. Let \[ \begin{aligned} f(w) = \frac{1}{2} w^{\mathtt{T}} A w - b^{\mathtt{T}}w, \quad w \in \mathbb R^n. \end{aligned} \]
\(\nabla f(w) = Aw-b.\)
\(w^* := \arg \min\limits_{w\in \mathbb R^n} \,f(w)= A^{-1}b .\)
\(A\) has an eigenvalue decomposition: There exists \(\lambda_1,\cdots,\lambda_n\in (0,\infty)\) and \(q_1,\cdots,q_n\in \mathbb R^n\) such that \[ \begin{aligned} A q_i = \lambda_i q_i,\quad \forall i =1\sim n. \end{aligned} \] Or, \[ \begin{aligned} A = Q \mathtt{diag}(\lambda_1,\lambda_2,\cdots,\lambda_n) Q^{\mathtt{T}}, \quad Q = \begin{bmatrix} & & \\ q_1 & \cdots & q_n \\ & & \\\end{bmatrix}. \end{aligned} \]
WLOG, we may assume \(\lambda_1 \leq \cdots \leq \lambda_n.\)
If we change a basis \(x=Q^{\mathtt{T}}(w-w^*),\) then the iterations break apart: For each \(i = 1,\cdots, n,\) \[ \begin{aligned} x_i^{k+1} &= x_i^k - \alpha \lambda_i x_i^k \cr &= (1 - \alpha \lambda_i) x_i^k = \cdots =(1-\alpha \lambda_i)^{k+1} x_i^0, \quad k\in \mathbb N. \end{aligned} \]
Recall that \(x_i^{k} = (1- \alpha\lambda_i)^k x_i^0.\) So we have \[ \begin{aligned} w^k-w^* = Qx^k &=\sum_{i=1}^n (1- \alpha\lambda_i)^k x_i^0 q_i. \end{aligned} \]
Hence, \[ \begin{aligned} f(w^k) - f(w^*) &= \sum_{i=1}^n (1-\alpha \lambda_i)^{2k} (x_i^0)^2 \lambda_i. \end{aligned} \]
The decomposition above gives us immediate guidance that how to set a step-size \(\alpha.\)
For the convergence of \(\lbrace w^k \rbrace_k\) and \(\lbrace f(w^k) \rbrace_k,\) all workable step-sizes \(\alpha,\) in each \(i\)-th component, is \[ \begin{aligned} \left\lvert 1-\alpha \lambda_i\right\rvert < 1. \end{aligned} \]
The equation above is equivalent to \[ \begin{aligned} \color{red}{{0 < \alpha \lambda_i < 2.}} \end{aligned} \]
The convergence rate of \(w^k\) to \(w^*\) is \[ \begin{aligned} \text{rate}(\alpha) &= \max_{i=1,\cdots,n} \left\lvert 1-\alpha \lambda_i \right\rvert \cr &= \max \bigl\lbrace \left\lvert 1-\alpha \lambda_1 \right\rvert , \left\lvert 1-\alpha \lambda_n \right\rvert \bigr\rbrace. \end{aligned} \]
Note that \[ \begin{aligned} \text{rate}(\alpha) < 1 &\iff \left\lvert 1-\alpha \lambda_1 \right\rvert < 1, \, \left\lvert 1-\alpha \lambda_n \right\rvert < 1 \cr &\iff 0 < \alpha < 2 / \lambda_n \end{aligned} \] and \[ \begin{aligned} \arg\min_{\alpha}\, \text{rate}(\alpha) &= \frac{2}{\lambda_1+\lambda_n}, \cr \min_{\alpha} \text{rate} (\alpha) &= {\color{red}{\frac{\lambda_n/\lambda_1-1}{\lambda_n/\lambda_1+1}}}. \end{aligned} \]
Notice the ratio \(\kappa := \lambda_n/\lambda_1\) determines the convergence rate of the problem.
This is the condition number of \(A.\)
If \(\kappa\) is larger, then the convergence rate will be slower.
Table of Contents
GD and Momentum
More analysis of GD
More analysis of Momentum
Other
Recall the momentum:
Given a random point \(w^0 \in \mathbb R^n.\)
Set \(z^0=\mathbf{0}\in \mathbb R^n.\)
Given any \(\alpha,\beta\in (0,\infty).\)
Similarly, with the change of basis \[ \begin{aligned} x^k = Q(w^k-w^*), \quad y^k = Q z^k, \end{aligned} \] the iterations break apart again. Hence, (1) becomes \[ \begin{aligned} y_i^{k+1} &= \beta y_i^k + \lambda_i x_i^k, \cr x_i^{k+1} &= x_i^k - \alpha y_i^{k+1} \end{aligned} \tag{2}\] in each \(i\)-th component.
For each \(i,\) we rewrite (2) as \[ \begin{aligned} \begin{bmatrix} y_i^k \\ x_i^k \\\end{bmatrix} = R_i^k \begin{bmatrix} y_i^0 \\ x_i^0 \\\end{bmatrix}, \quad R_i = \begin{bmatrix} \beta & -\alpha \beta \\ \lambda_i & 1-\alpha \lambda_i \\\end{bmatrix}. \end{aligned} \]
For convenience, we omit \(i\) here.
Let \(\sigma_1,\sigma_2\) be two eigenvalues of \(R= R_i = \begin{bmatrix} \beta & -\alpha \beta \\ \lambda & 1-\alpha \lambda \\\end{bmatrix}.\)
By Williams (1992), \[ \begin{aligned} R^k = \begin{cases} \sigma_1^k R(1) - \sigma_2^k R(2) , & \sigma_1 \neq \sigma_2 ; \cr \sigma_1^k \bigl( k R / \sigma_1 - (k-1)I_2 \bigr) , & \sigma_1 = \sigma_2, \end{cases} \qquad R(j) = \frac{R - \sigma_j I_2}{\sigma_1 - \sigma_2}. \end{aligned} \] Thus, the convergence rate of \(\lbrace x_i^k \rbrace_k\) is \(\max\lbrace \left\lvert \sigma_1 \right\rvert , \left\lvert \sigma_2 \right\rvert \rbrace.\)
For the values of \(\sigma_1,\sigma_2\):
By some calculating, we have \[ \begin{aligned} \sigma_1 &= \frac{1}{2} \bigl( 1-\alpha \lambda + \beta + \sqrt{( -\alpha \lambda + \beta + 1 )^2 - 4\beta } \bigr), \cr \sigma_2 &= \frac{1}{2} \bigl( 1-\alpha \lambda + \beta - \sqrt{( -\alpha \lambda + \beta + 1 )^2 - 4\beta } \bigr). \end{aligned} \]
When \(( -\alpha \lambda + \beta + 1 )^2 - 4\beta < 0,\) we have \(\sigma_1,\sigma_2\in \mathbb C\backslash \mathbb R\) and \[ \begin{aligned} \left\lvert \sigma_1 \right\rvert = \left\lvert \sigma_2 \right\rvert = \frac{1}{2}\sqrt{ (1-\alpha\lambda+\beta)^2 + \left\lvert ( -\alpha \lambda + \beta + 1 )^2 - 4\beta \right\rvert } = \sqrt \beta. \end{aligned} \]
The range of \(\alpha\) gets bigger when we add momentum: \[ \begin{aligned} 0<\alpha \lambda_i < 2 \stackrel{\text{with momentum}}{\Longrightarrow} 0<\alpha\lambda_i<2+2\beta. \end{aligned} \]
What is the best choice between \(\alpha,\beta\)?
Recall that in GD without momentum:
The convergence rate of \(w^k_i\) to \(w^*_i\) (\(i\)-th component) is \(\left\lvert 1-\alpha \lambda_i \right\rvert;\)
The convergence rate of \(w^k\) to \(w^*\) is \[ \begin{aligned} \text{rate}(\alpha) = \max \bigl\lbrace \left\lvert 1-\alpha \lambda_1 \right\rvert , \left\lvert 1-\alpha \lambda_n \right\rvert \bigr\rbrace; \end{aligned} \]
\[ \begin{aligned} \arg \min_{\alpha} \, \text{rate}(\alpha) &= \frac{2}{\lambda_1+\lambda_n}, \cr \min_{\alpha} \text{rate} (\alpha) &= \frac{\lambda_n/\lambda_1-1}{\lambda_n/\lambda_1+1}. \end{aligned} \]
With Momentum?
We first analyze \(i\)-th component with \(\alpha\) fixed.
We see the \(i\)-th component with \(\alpha\) fixed.
Recall that \[ \begin{aligned} y_i^{k+1} &= \beta y_i^k + \lambda_i x_i^k, \cr x_i^{k+1} &= x_i^k - \alpha y_i^{k+1}, \end{aligned} \] in each \(i\)-th component.
Or, \[ \begin{aligned} \begin{bmatrix} y_i^k \\ x_i^k \\\end{bmatrix} = R_i^k \begin{bmatrix} y_i^0 \\ x_i^0 \\\end{bmatrix}, \qquad R_i = \begin{bmatrix} \beta & -\alpha \beta \\ \lambda_i & 1-\alpha \lambda_i \\\end{bmatrix}. \end{aligned} \]
The convergence rate of \(\lbrace x_i^k \rbrace_k\) is \(\max\lbrace \left\lvert \sigma_1(i) \right\rvert , \left\lvert \sigma_2(i) \right\rvert \rbrace,\) where \(\sigma_1(i),\sigma_2(i)\) are two eigenvalues of \(R_i.\)
Here we can use the physical point of view.
The critical case happens when \(\sigma_1(i)=\sigma_2(i).\)
The minimum convergence rate is \(\color{red}{1-\sqrt{\alpha \lambda_i}}.\)
The minimum convergence rate of \(w^k_i\) to \(w^*_i\) (\(i\)-th component) varies as follows: \[ \begin{aligned} 1-\alpha\lambda_i \stackrel{\text{with momentum}}{\Longrightarrow} 1-\sqrt{\alpha \lambda_i}. \end{aligned} \] But this only applies to the error in the \(i\)-th component with \(\alpha\) fixed.
To get a global convergence rate, we must optimize over both \(\alpha,\beta.\)
The convergence rate of \(w^k\) to \(w^*\) is \[ \begin{aligned} \text{rate}(\alpha,\beta) = \max_{i=1,\cdots,n} \Bigl\lbrace \max\bigl\lbrace \big\lvert \sigma_1(i) \big\rvert , \left\lvert \sigma_2(i) \right\rvert \bigr\rbrace \Bigr\rbrace. \end{aligned} \]
The solutions of \(\arg\min\limits_{\alpha,\beta}\text{ rate}(\alpha,\beta)\) are \[ \begin{aligned} \alpha = \biggl( \frac{2}{\sqrt{\lambda_1}+ \sqrt{\lambda_2}} \biggr)^2, \quad \beta = \biggl( \frac{\sqrt{\lambda_n}-\sqrt{\lambda_1}}{\sqrt{\lambda_n}+ \sqrt{\lambda_1}} \biggr)^2. \end{aligned} \] Plug this into \(\text{rate}(\alpha,\beta),\) and we get \[ \begin{aligned} {\color{red}{\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}.}} \end{aligned} \]
So we get how much momentum accelerates the convergence rate: \[ \begin{aligned} \frac{\kappa-1}{\kappa +1}\stackrel{\text{with momentum}}{\Longrightarrow} \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}. \end{aligned} \]
Let’s look at the following example to see that momentum really has the effect of accelerating:
Question
How slowly does GD converge?
Question
In GD, what’s the range of \(\alpha\)?
Question
In Momentum, what’s the range of \(\alpha,\beta\)?
Question
How much does Momentum accelerate?
Question
Momentum has the effect of increasing the range of allowed step sizes. Why? How much is the increase?
Table of Contents
GD and Momentum
More analysis of GD
More analysis of Momentum
Other
Question
Can it be corrected on the momentum method to make the result better?
Question
Is there some kind of descent method that allows our \(\lbrace w^m \rbrace_m\) to converge faster than momentum?
By giving a lower bound, Nesterov (2013) states that momentum is optimal in a certain very narrow and technical sense.
In GD, \[ \begin{aligned} w^1 &= w^0 - \alpha \nabla f(w^0), \cr w^2 &= w^1 - \alpha \nabla f(w^1) \cr &= w^1 - \alpha \nabla f(w^0) - \alpha \nabla f(w^1), \cr &\quad\vdots \end{aligned} \] we can write GD as \[ \begin{aligned} w^{m+1} = w^0 - \alpha \sum_{i=0}^m \nabla f(w^i). \end{aligned} \]
Now we consider our descent method is Linear First Order Methods: \[ \begin{aligned} w^{m+1} = w^0 + \sum_{i=0}^m \Gamma_i^m \nabla f(w^i) \end{aligned} \] for some diagonal \(n\) by \(n\) matrix \(\Gamma_i^m.\)
For GD, \(\Gamma_i^m = \alpha I_n.\)
For momentum, \(\Gamma_i^m = \alpha \Bigl( \frac{1-\beta^{m+1-i}}{1-\beta} \Bigr) I_n.\)
This class of methods covers most of the popular algorithms for training neural networks, including ADAM and AdaGrad.
We will show a single function all these methods ultimately fail on.
Fix some \(\kappa\in (0,\infty).\) We consider the Convex Rosenbrock function: \[ \begin{aligned} f=f_n&:\mathbb R^n \longrightarrow \mathbb R\cr f(w) &= \frac{1}{2}(w_1-1)^2 + \frac{1}{2}\sum_{i=1}^n(w_i-w_{i+1})^2 + \frac{2}{\kappa-1} \left\lVert w \right\rVert_2^2. \end{aligned} \]
The problem \(f\) is convex quadratic.
Nesterov called it “the worst function in the world”.
Let \(\kappa_n\) be the condition number of the problem \(f.\) Then \(\lim_{n\rightarrow\infty}\kappa_n = \kappa.\)
The optimal solution to the problem \(f\) is \[ \begin{aligned} w^*=(w_1^*, \cdots , w_n^*), \qquad w_i^* = \biggl( \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1} \biggr)^i. \end{aligned} \]
We observe the behavior of any linear first-order algorithm on \(f\), starting from \(w^0 = \mathbf{0}.\)
Thus, \[ \begin{aligned} \left\lVert w^m-w^* \right\rVert_{\infty} &\geq \max_{i>m} \lbrace \left\lvert w_i^* \right\rvert \rbrace \cr &= \biggl( \frac{\sqrt{\kappa}-1}{\sqrt \kappa +1} \biggr)^{m+1} \cr &= \biggl( \frac{\sqrt{\kappa}-1}{\sqrt \kappa +1} \biggr)^{m} \left\lVert w^0-w^* \right\rVert_{\infty}. \end{aligned} \] This shows that the convergence rate of \(\lbrace w^m \rbrace_m\) (under the sup norm) is greater than or equal to \(\frac{\sqrt{\kappa}+1}{\sqrt{\kappa}-1}.\)
Recall that \(\lim_{n\rightarrow\infty}\kappa_n = \kappa,\) where \(\kappa_n\) is the condition number of the problem \(f.\)
Hence, momentum is already the fastest accelerator.
The Colorization Problem is of the form \[ \begin{aligned} \text{minimize} \quad\frac{1}{2}\sum_{w_i\in D}(w_i-1)^2 + \frac{1}{2} \sum_{w_i\sim w_j}(w_i- w_j)^2, \end{aligned} \] where \(D\) are nodes that we want to color with \(1.\)
The Convex Rosenbrock function is a special case of the above.