Gradient Descent and Momentum

ChoCho

NCU Math

2023-05-10

MODIFIED

  • 2023-05-11

    • 修正一些數學錯誤以及補充一些細節.
  • 2023-05-26

    • 修正 question 格式以及增加圖解.

Abstract

  • I would like to introduce the article of the site https://distill.pub/2017/momentum/.

  • This article describes some of the properties and disadvantages of gradient descent.

  • The main part of this article is to introduce how to use momentum to improve the above disadvantages.

GD and Momentum

Table of Contents

  • GD and Momentum

  • More analysis of GD

  • More analysis of Momentum

  • Other

Recall GD

  • Let \(f:\mathbb R^n \longrightarrow \mathbb R\) be a differentiable function.

  • Given a random point \(w^0 \in \mathbb R^n.\)

  • Given any \(\alpha\in (0,\infty)\) which we call it the step size.

  • We iterate \(w^{k}\) by \[ \begin{aligned} w^{k+1} = w^k - \alpha \nabla f(w^k). \end{aligned} \]


    • Note that for each fixed \(w\in \mathbb R^n,\) \[ \begin{aligned} f(w+h) = f(w)+ \nabla f(w)^{\mathtt T}\cdot h + o(\left\lVert h \right\rVert) \end{aligned} \] for \(\left\lVert h \right\rVert \rightarrow 0.\)

      • We choose \(h=-\alpha \nabla f(w).\)

      • Since \(\color{blue}{\nabla f(w)^{\mathtt{T}}\cdot h < 0},\) we have \[ \begin{aligned} f\bigl( \underbrace{w-\alpha \nabla f(w)}_{w^{k+1}} \bigr) = f(w+h) < f(\underbrace{w}_{w^k}) \end{aligned} \]
        for \(\alpha>0\) small enough.

        Question

        What is the range of \(\alpha\)?

Disadvantages of GD

  • GD often gets stuck in local minima.

    • We don’t pay attention to this at this time.
  • GD will be very slow if the function is in the following cases.

https://paperswithcode.com/method/sgd-with-momentum

  • How to fix this condition?

    • We may need something like momentum.

Momentum

  • Momentum was proposed by Polyak (1964).

  • The idea of ​​GD is the method of steepest descent. But we do not need every step to be the steepest.

  • We may consider how a ball descends to its lowest point.

https://www.youtube.com/watch?v=oiXWerMTcNE

Momentum in math

https://youtu.be/xki61j7z-30
  • Given a random point \(w^0 \in \mathbb R^n.\)

  • Set \(z^0=\mathbf{0}\in \mathbb R^n.\)

  • Given any \(\alpha,\beta\in (0,\infty).\)

  • We give gradient descent a short-term memory:

    We iterate \(z^k,w^k\) by \[ \begin{aligned} z^{k+1} &= \beta z^k + \nabla f(w^k), \cr w^{k+1} &= w^k - \alpha z^{k+1}. \end{aligned} \]

  • Note that \[ \begin{aligned} f(w^k)^{\mathtt{T}} \cdot (-1)\bigl( \underbrace{\beta z^k + \nabla f(w^k)}_{z^{k+1}} \bigr) < 0 \end{aligned} \] for \(\beta\) small enough.

  • This shows that \[ \begin{aligned} f\bigl( \underbrace{w^k-\alpha z^{k+1}}_{w^{k+1}} \bigr) < f(w^k) \end{aligned} \]
    for \(\alpha,\beta>0\) small enough.

    Question

    What is the range of \(\alpha,\beta\)?

Momentum’s improvements

  • Momentum can make GD better escape the local minimum. We do not discuss this here.

https://gbhat.com/machine_learning/gradient_descent_with_momentum.html

Momentum’s improvements

  • Momentum can speed up GD in some cases.

https://paperswithcode.com/method/sgd-with-momentum

  • But this is too simplistic and we do not explain many things. We need more mathematical analysis.

Step-size α = 0.02

Momentum β = 0.99


  • By looking at this example, we naturally have the following questions:

    Question

    How slowly does GD converge?

    To measure “how slowly”, we need to introduce the convergence speed.

    Question

    In GD, what’s the range of \(\alpha\)?

    Question

    In Momentum, what’s the range of \(\alpha,\beta\)?

    Question

    How much does Momentum accelerate?

    Question

    Momentum has the effect of increasing the range of allowed step sizes. Why? How much is the increase?

Converge rate

  • A sequence \(\lbrace x_k \rbrace_k\) that converges to \(x^*\) (in a norm \(\left\lVert \cdot \right\rVert\)) is said to have convergence rate \(\mu\) (order \(1\)) if \[ \begin{aligned} \lim_{n\rightarrow\infty} \frac{\left\lVert x_{n+1} - x^* \right\rVert}{\left\lVert x_{n} -x^* \right\rVert} = \mu. \end{aligned} \]

  • For example:

    • If \(x_k = (1/3)^kc_0,\) then the convergence rate of \(\lbrace x_k \rbrace_k\) is \(1/3.\)

    • If \(x_k = (1/2)^kc_1 + (-1/3)^k c_2,\) then the convergence rate of \(\lbrace x_k \rbrace\) is \(1/2.\)

  • Roughly speaking, if \(\lbrace x_k \rbrace,\lbrace y_k \rbrace\) have convergence rates \(\mu_1,\mu_2\) respectively and \(0<\mu_1<\mu_2<1,\) then \(\lbrace x_k \rbrace_k\) converges faster than \(\lbrace y_k \rbrace_k.\)

More analysis of GD

Table of Contents

  • GD and Momentum

  • More analysis of GD

  • More analysis of Momentum

  • Other

Settings

  • From now on, we consider the convex quadratic:

    Assume \(A\) is symmetric, positive definite and invertible. Let \[ \begin{aligned} f(w) = \frac{1}{2} w^{\mathtt{T}} A w - b^{\mathtt{T}}w, \quad w \in \mathbb R^n. \end{aligned} \]

  • \(\nabla f(w) = Aw-b.\)

  • \(w^* := \arg \min\limits_{w\in \mathbb R^n} \,f(w)= A^{-1}b .\)

GD

  • We can observe GD from the normal form of the ellipse.
  • \(A\) has an eigenvalue decomposition: There exists \(\lambda_1,\cdots,\lambda_n\in (0,\infty)\) and \(q_1,\cdots,q_n\in \mathbb R^n\) such that \[ \begin{aligned} A q_i = \lambda_i q_i,\quad \forall i =1\sim n. \end{aligned} \] Or, \[ \begin{aligned} A = Q \mathtt{diag}(\lambda_1,\lambda_2,\cdots,\lambda_n) Q^{\mathtt{T}}, \quad Q = \begin{bmatrix} & & \\ q_1 & \cdots & q_n \\ & & \\\end{bmatrix}. \end{aligned} \]

  • WLOG, we may assume \(\lambda_1 \leq \cdots \leq \lambda_n.\)

  • If we change a basis \(x=Q^{\mathtt{T}}(w-w^*),\) then the iterations break apart: For each \(i = 1,\cdots, n,\) \[ \begin{aligned} x_i^{k+1} &= x_i^k - \alpha \lambda_i x_i^k \cr &= (1 - \alpha \lambda_i) x_i^k = \cdots =(1-\alpha \lambda_i)^{k+1} x_i^0, \quad k\in \mathbb N. \end{aligned} \]

Decomposing the Error of GD

  • Recall that \(x_i^{k} = (1- \alpha\lambda_i)^k x_i^0.\) So we have \[ \begin{aligned} w^k-w^* = Qx^k &=\sum_{i=1}^n (1- \alpha\lambda_i)^k x_i^0 q_i. \end{aligned} \]

  • Hence, \[ \begin{aligned} f(w^k) - f(w^*) &= \sum_{i=1}^n (1-\alpha \lambda_i)^{2k} (x_i^0)^2 \lambda_i. \end{aligned} \]

  • The decomposition above gives us immediate guidance that how to set a step-size \(\alpha.\)

  • For the convergence of \(\lbrace w^k \rbrace_k\) and \(\lbrace f(w^k) \rbrace_k,\) all workable step-sizes \(\alpha,\) in each \(i\)-th component, is \[ \begin{aligned} \left\lvert 1-\alpha \lambda_i\right\rvert < 1. \end{aligned} \]

  • The equation above is equivalent to \[ \begin{aligned} \color{red}{{0 < \alpha \lambda_i < 2.}} \end{aligned} \]

Choosing a step-size \(\alpha\) (GD)

  • The convergence rate of \(w^k\) to \(w^*\) is \[ \begin{aligned} \text{rate}(\alpha) &= \max_{i=1,\cdots,n} \left\lvert 1-\alpha \lambda_i \right\rvert \cr &= \max \bigl\lbrace \left\lvert 1-\alpha \lambda_1 \right\rvert , \left\lvert 1-\alpha \lambda_n \right\rvert \bigr\rbrace. \end{aligned} \]

  • Note that \[ \begin{aligned} \text{rate}(\alpha) < 1 &\iff \left\lvert 1-\alpha \lambda_1 \right\rvert < 1, \, \left\lvert 1-\alpha \lambda_n \right\rvert < 1 \cr &\iff 0 < \alpha < 2 / \lambda_n \end{aligned} \] and \[ \begin{aligned} \arg\min_{\alpha}\, \text{rate}(\alpha) &= \frac{2}{\lambda_1+\lambda_n}, \cr \min_{\alpha} \text{rate} (\alpha) &= {\color{red}{\frac{\lambda_n/\lambda_1-1}{\lambda_n/\lambda_1+1}}}. \end{aligned} \]

  • Notice the ratio \(\kappa := \lambda_n/\lambda_1\) determines the convergence rate of the problem.

    • This is the condition number of \(A.\)

    • If \(\kappa\) is larger, then the convergence rate will be slower.

Optimization can be seen as combination of several component problems, shown here as 1 2 3 with eigenvalues λ1=0.01, λ2=0.1, and λ3=1 respectively.
  Step-size

Optimal Step-size

More analysis of Momentum

Table of Contents

  • GD and Momentum

  • More analysis of GD

  • More analysis of Momentum

  • Other

Momentum

  • Recall the momentum:

    • Given a random point \(w^0 \in \mathbb R^n.\)

    • Set \(z^0=\mathbf{0}\in \mathbb R^n.\)

    • Given any \(\alpha,\beta\in (0,\infty).\)

    • We iterate \(z^k,w^k\) by \[ \begin{aligned} z^{k+1} &= \beta z^k + \nabla f(w^k), \cr w^{k+1} &= w^k - \alpha z^{k+1}. \end{aligned} \tag{1}\]
  • Similarly, with the change of basis \[ \begin{aligned} x^k = Q(w^k-w^*), \quad y^k = Q z^k, \end{aligned} \] the iterations break apart again. Hence, (1) becomes \[ \begin{aligned} y_i^{k+1} &= \beta y_i^k + \lambda_i x_i^k, \cr x_i^{k+1} &= x_i^k - \alpha y_i^{k+1} \end{aligned} \tag{2}\] in each \(i\)-th component.

  • For each \(i,\) we rewrite (2) as \[ \begin{aligned} \begin{bmatrix} y_i^k \\ x_i^k \\\end{bmatrix} = R_i^k \begin{bmatrix} y_i^0 \\ x_i^0 \\\end{bmatrix}, \quad R_i = \begin{bmatrix} \beta & -\alpha \beta \\ \lambda_i & 1-\alpha \lambda_i \\\end{bmatrix}. \end{aligned} \]

To get \(R_i^k\)

  • For convenience, we omit \(i\) here.

  • Let \(\sigma_1,\sigma_2\) be two eigenvalues of \(R= R_i = \begin{bmatrix} \beta & -\alpha \beta \\ \lambda & 1-\alpha \lambda \\\end{bmatrix}.\)

  • By Williams (1992), \[ \begin{aligned} R^k = \begin{cases} \sigma_1^k R(1) - \sigma_2^k R(2) , & \sigma_1 \neq \sigma_2 ; \cr \sigma_1^k \bigl( k R / \sigma_1 - (k-1)I_2 \bigr) , & \sigma_1 = \sigma_2, \end{cases} \qquad R(j) = \frac{R - \sigma_j I_2}{\sigma_1 - \sigma_2}. \end{aligned} \] Thus, the convergence rate of \(\lbrace x_i^k \rbrace_k\) is \(\max\lbrace \left\lvert \sigma_1 \right\rvert , \left\lvert \sigma_2 \right\rvert \rbrace.\)

  • For the values of \(\sigma_1,\sigma_2\):

    • By some calculating, we have \[ \begin{aligned} \sigma_1 &= \frac{1}{2} \bigl( 1-\alpha \lambda + \beta + \sqrt{( -\alpha \lambda + \beta + 1 )^2 - 4\beta } \bigr), \cr \sigma_2 &= \frac{1}{2} \bigl( 1-\alpha \lambda + \beta - \sqrt{( -\alpha \lambda + \beta + 1 )^2 - 4\beta } \bigr). \end{aligned} \]

    • When \(( -\alpha \lambda + \beta + 1 )^2 - 4\beta < 0,\) we have \(\sigma_1,\sigma_2\in \mathbb C\backslash \mathbb R\) and \[ \begin{aligned} \left\lvert \sigma_1 \right\rvert = \left\lvert \sigma_2 \right\rvert = \frac{1}{2}\sqrt{ (1-\alpha\lambda+\beta)^2 + \left\lvert ( -\alpha \lambda + \beta + 1 )^2 - 4\beta \right\rvert } = \sqrt \beta. \end{aligned} \]

      • In this case, the convergence rate of \(\lbrace x_i^k \rbrace_k\) is independent of \(\alpha,\lambda.\)

Explanation of the plot of convergence rate in each \(i\) (Momentum)

A plot of convergence rate in each \(i\) (Momentum)

  • Here we show the case \(\lambda_i = 1.\)
Convergence Rate

A plot of max{|σ12|} reveals distinct regions, each with its own style of convergence.

  • Combining the above five situations, we can conclude that the range of available step-sizes works out to be \[ \begin{aligned} {\color{red}{0<\alpha \lambda_i < 2+2\beta , \quad 0 \leq \beta < 1.}} \end{aligned} \]
  • The range of \(\alpha\) gets bigger when we add momentum: \[ \begin{aligned} 0<\alpha \lambda_i < 2 \stackrel{\text{with momentum}}{\Longrightarrow} 0<\alpha\lambda_i<2+2\beta. \end{aligned} \]

  • What is the best choice between \(\alpha,\beta\)?

Optimal parameters

  • Recall that in GD without momentum:

    • The convergence rate of \(w^k_i\) to \(w^*_i\) (\(i\)-th component) is \(\left\lvert 1-\alpha \lambda_i \right\rvert;\)

    • The convergence rate of \(w^k\) to \(w^*\) is \[ \begin{aligned} \text{rate}(\alpha) = \max \bigl\lbrace \left\lvert 1-\alpha \lambda_1 \right\rvert , \left\lvert 1-\alpha \lambda_n \right\rvert \bigr\rbrace; \end{aligned} \]

    • \[ \begin{aligned} \arg \min_{\alpha} \, \text{rate}(\alpha) &= \frac{2}{\lambda_1+\lambda_n}, \cr \min_{\alpha} \text{rate} (\alpha) &= \frac{\lambda_n/\lambda_1-1}{\lambda_n/\lambda_1+1}. \end{aligned} \]

  • With Momentum?

  • We first analyze \(i\)-th component with \(\alpha\) fixed.

Optimal parameters (\(i\)-th component)

  • We see the \(i\)-th component with \(\alpha\) fixed.

  • Recall that \[ \begin{aligned} y_i^{k+1} &= \beta y_i^k + \lambda_i x_i^k, \cr x_i^{k+1} &= x_i^k - \alpha y_i^{k+1}, \end{aligned} \] in each \(i\)-th component.

  • Or, \[ \begin{aligned} \begin{bmatrix} y_i^k \\ x_i^k \\\end{bmatrix} = R_i^k \begin{bmatrix} y_i^0 \\ x_i^0 \\\end{bmatrix}, \qquad R_i = \begin{bmatrix} \beta & -\alpha \beta \\ \lambda_i & 1-\alpha \lambda_i \\\end{bmatrix}. \end{aligned} \]

  • The convergence rate of \(\lbrace x_i^k \rbrace_k\) is \(\max\lbrace \left\lvert \sigma_1(i) \right\rvert , \left\lvert \sigma_2(i) \right\rvert \rbrace,\) where \(\sigma_1(i),\sigma_2(i)\) are two eigenvalues of \(R_i.\)

  • Here we can use the physical point of view.

    • It is a discretization of a damped harmonic oscillator.

OverDamping

CriticalDamping
Figure 1: These gifs are from https://highscope.ch.ntu.edu.tw/wordpress/?p=59534 and https://youtu.be/99ZE2RGwqSM.
  • The critical case happens when \(\sigma_1(i)=\sigma_2(i).\)

    • That is, \(\beta=(1-\sqrt{\alpha \lambda_i})^2.\)
  • The minimum convergence rate is \(\color{red}{1-\sqrt{\alpha \lambda_i}}.\)

  • The minimum convergence rate of \(w^k_i\) to \(w^*_i\) (\(i\)-th component) varies as follows: \[ \begin{aligned} 1-\alpha\lambda_i \stackrel{\text{with momentum}}{\Longrightarrow} 1-\sqrt{\alpha \lambda_i}. \end{aligned} \] But this only applies to the error in the \(i\)-th component with \(\alpha\) fixed.

Optimal parameters

  • To get a global convergence rate, we must optimize over both \(\alpha,\beta.\)

  • The convergence rate of \(w^k\) to \(w^*\) is \[ \begin{aligned} \text{rate}(\alpha,\beta) = \max_{i=1,\cdots,n} \Bigl\lbrace \max\bigl\lbrace \big\lvert \sigma_1(i) \big\rvert , \left\lvert \sigma_2(i) \right\rvert \bigr\rbrace \Bigr\rbrace. \end{aligned} \]

  • The solutions of \(\arg\min\limits_{\alpha,\beta}\text{ rate}(\alpha,\beta)\) are \[ \begin{aligned} \alpha = \biggl( \frac{2}{\sqrt{\lambda_1}+ \sqrt{\lambda_2}} \biggr)^2, \quad \beta = \biggl( \frac{\sqrt{\lambda_n}-\sqrt{\lambda_1}}{\sqrt{\lambda_n}+ \sqrt{\lambda_1}} \biggr)^2. \end{aligned} \] Plug this into \(\text{rate}(\alpha,\beta),\) and we get \[ \begin{aligned} {\color{red}{\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}.}} \end{aligned} \]

  • So we get how much momentum accelerates the convergence rate: \[ \begin{aligned} \frac{\kappa-1}{\kappa +1}\stackrel{\text{with momentum}}{\Longrightarrow} \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}. \end{aligned} \]

  • Let’s look at the following example to see that momentum really has the effect of accelerating:

We can do the same decomposition here with momentum, with eigenvalues λ1=0.01, λ2=0.1, and λ3=1. Though the decrease is no longer monotonic, but significantly faster.
f(wk)-f(w*)
Note that the optimal parameters do not necessarily imply the fastest convergence, though, only the fastest asymptotic convergence rate.
Step-size α =
Momentum β =

Conclusions

Question

How slowly does GD converge?

Ans:
The convergence rate of \(w^k\) to \(w^*\) is \(\dfrac{\kappa-1}{\kappa+1}.\)

Question

In GD, what’s the range of \(\alpha\)?

Ans:
\(0< \alpha \lambda_i<2\) for each \(i.\)

Question

In Momentum, what’s the range of \(\alpha,\beta\)?

Ans:
\(0<\alpha \lambda_i < 2+2\beta , \quad 0 \leq \beta < 1\) for each \(i.\)

Question

How much does Momentum accelerate?

Ans:
\[ \begin{aligned} \frac{\kappa-1}{\kappa +1}\stackrel{\text{with momentum}}{\Longrightarrow} \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}. \end{aligned} \]

Question

Momentum has the effect of increasing the range of allowed step sizes. Why? How much is the increase?

Ans:
\[ \begin{aligned} 0<\alpha \lambda_i < 2 \stackrel{\text{with momentum}}{\Longrightarrow} 0<\alpha\lambda_i<2+2\beta. \end{aligned} \]

Other

Table of Contents

  • GD and Momentum

  • More analysis of GD

  • More analysis of Momentum

  • Other

The Limits of Descent

Question

Can it be corrected on the momentum method to make the result better?

Question

Is there some kind of descent method that allows our \(\lbrace w^m \rbrace_m\) to converge faster than momentum?

  • By giving a lower bound, Nesterov (2013) states that momentum is optimal in a certain very narrow and technical sense.

  • In GD, \[ \begin{aligned} w^1 &= w^0 - \alpha \nabla f(w^0), \cr w^2 &= w^1 - \alpha \nabla f(w^1) \cr &= w^1 - \alpha \nabla f(w^0) - \alpha \nabla f(w^1), \cr &\quad\vdots \end{aligned} \] we can write GD as \[ \begin{aligned} w^{m+1} = w^0 - \alpha \sum_{i=0}^m \nabla f(w^i). \end{aligned} \]

The Limits of Descent

  • Similar to momentum, \[ \begin{aligned} w^{m+1} = w^0 + \alpha \sum_{i=0}^m \frac{1-\beta^{m+1-i}}{1-\beta} \nabla f(w^i). \end{aligned} \]
  • Now we consider our descent method is Linear First Order Methods: \[ \begin{aligned} w^{m+1} = w^0 + \sum_{i=0}^m \Gamma_i^m \nabla f(w^i) \end{aligned} \] for some diagonal \(n\) by \(n\) matrix \(\Gamma_i^m.\)

    • For GD, \(\Gamma_i^m = \alpha I_n.\)

    • For momentum, \(\Gamma_i^m = \alpha \Bigl( \frac{1-\beta^{m+1-i}}{1-\beta} \Bigr) I_n.\)

    • This class of methods covers most of the popular algorithms for training neural networks, including ADAM and AdaGrad.

  • We will show a single function all these methods ultimately fail on.

  • Fix some \(\kappa\in (0,\infty).\) We consider the Convex Rosenbrock function: \[ \begin{aligned} f=f_n&:\mathbb R^n \longrightarrow \mathbb R\cr f(w) &= \frac{1}{2}(w_1-1)^2 + \frac{1}{2}\sum_{i=1}^n(w_i-w_{i+1})^2 + \frac{2}{\kappa-1} \left\lVert w \right\rVert_2^2. \end{aligned} \]

    • The problem \(f\) is convex quadratic.

    • Nesterov called it “the worst function in the world”.

    • Let \(\kappa_n\) be the condition number of the problem \(f.\) Then \(\lim_{n\rightarrow\infty}\kappa_n = \kappa.\)

    • The optimal solution to the problem \(f\) is \[ \begin{aligned} w^*=(w_1^*, \cdots , w_n^*), \qquad w_i^* = \biggl( \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1} \biggr)^i. \end{aligned} \]

      • Without the regular term \(\frac{2}{\kappa-1}\left\lVert w \right\rVert_2^2,\) the solution is \(\mathbf{1}.\)
  • We observe the behavior of any linear first-order algorithm on \(f\), starting from \(w^0 = \mathbf{0}.\)

Step-size α =
Momentum β =
Here we see the first 50 iterates of momentum on the Convex Rosenbrock for n=25. The behavior here is similar to that of any Linear First Order Algorithm.
This triangle is a “dead zone” of our iterates. The iterates are always 0, no matter what the parameters.
The remaining expanding space is the “light cone” of our iterate’s influence. Momentum does very well here with the optimal parameters.
Error
Weights
  • Since \[ \begin{aligned} \nabla f(w)_i = -(w_{i-1}-w_i) + (w_i-w_{i+1}) + \frac{4}{\kappa-1} w_i, \quad i>1, \end{aligned} \] by induction, for any linear first order algorithm, \[ \begin{aligned} w^0 &= ( && 0, &&0, &&0, &&\cdots, &&0, &&\quad 0,\cdots,0 &&) ,\cr w^1 &= ( && w_1^1, &&0, &&0, &&\cdots, &&0, &&\quad 0,\cdots,0 &&) , \cr w^2 &= ( && w_1^2, &&w_2^2, &&0, &&\cdots, &&0, &&\quad 0,\cdots,0 &&) , \cr && \vdots \cr w^m &= ( && w_1^m, &&w_2^m, &&w_3^m, &&\cdots, &&w_m^m, &&\quad\underbrace{0,\cdots,0}_{\text{index}>m} &&) . \end{aligned} \]
  • Thus, \[ \begin{aligned} \left\lVert w^m-w^* \right\rVert_{\infty} &\geq \max_{i>m} \lbrace \left\lvert w_i^* \right\rvert \rbrace \cr &= \biggl( \frac{\sqrt{\kappa}-1}{\sqrt \kappa +1} \biggr)^{m+1} \cr &= \biggl( \frac{\sqrt{\kappa}-1}{\sqrt \kappa +1} \biggr)^{m} \left\lVert w^0-w^* \right\rVert_{\infty}. \end{aligned} \] This shows that the convergence rate of \(\lbrace w^m \rbrace_m\) (under the sup norm) is greater than or equal to \(\frac{\sqrt{\kappa}+1}{\sqrt{\kappa}-1}.\)

  • Recall that \(\lim_{n\rightarrow\infty}\kappa_n = \kappa,\) where \(\kappa_n\) is the condition number of the problem \(f.\)

  • Hence, momentum is already the fastest accelerator.

Example: The Colorization Problem

  • The Colorization Problem is of the form \[ \begin{aligned} \text{minimize} \quad\frac{1}{2}\sum_{w_i\in D}(w_i-1)^2 + \frac{1}{2} \sum_{w_i\sim w_j}(w_i- w_j)^2, \end{aligned} \] where \(D\) are nodes that we want to color with \(1.\)

  • The Convex Rosenbrock function is a special case of the above.

https://algoritmaonline.com/image-colorization/

Reference

Nesterov, Yurii. 2013. Introductory Lectures on Convex Optimization: A Basic Course. Vol. 87. Springer Science & Business Media. https://doi.org/10.1007/978-1-4419-8853-9.
Polyak, B. T. 1964. “Some Methods of Speeding up the Convergence of Iteration Methods.” USSR Computational Mathematics and Mathematical Physics 4 (5): 1–17. https://doi.org/https://doi.org/10.1016/0041-5553(64)90137-5.
Williams, Kenneth. 1992. “The Nth Power of a 2x2 Matrix.” Mathematics Magazine 65 (5): 336. https://doi.org/10.2307/2691246.