In this exercise we explore how the early stopping of the gradient descent iterations (see Example B.10),

Question:

In this exercise we explore how the early stopping of the gradient descent iterations (see Example B.10),

\[ \boldsymbol{x}_{t+1}=\boldsymbol{x}_{t}-\alpha abla f\left(\boldsymbol{x}_{t}\right), \quad t=0,1, \ldots \]


is (approximately) equivalent to the global minimization of \(f(\mathbf{x})+\frac{1}{2} \gamma\|\mathbf{x}\|^{2}\) for certain values of the ridge regularization parameter \(\gamma>0\) (see Example 6.1). We illustrate the early stopping idea on the quadratic function \(f(\boldsymbol{x})+\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^{\top} \mathbf{H}(\boldsymbol{x}-\boldsymbol{\mu})\), where \(\mathbf{H} \in \mathbb{R}^{n \times n}\) is a symmetric positive-definite (Hessian) matrix with eigenvalues \(\left\{\lambda_{k}\right\}_{k=1}^{n}\).

(a) Verify that for a symmetric matrix \(\mathbf{A} \in \mathbb{R}^{n}\) such that \(\mathbf{I}-\mathbf{A}\) is invertible, we have \[ \mathbf{I}+\mathbf{A}+\cdots+\mathbf{A}^{t-1}=\left(\mathbf{I}-\mathbf{A}^{t}\right)\left(\mathbf{I}-\mathbf{A}^{t}\right)(\mathbf{I}-\mathbf{A})^{-1} \]

(b) Let \(\mathbf{H}=\mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{\top}\) be the diagonalization of \(\mathbf{H}\) as per Theorem A.8. If \(\boldsymbol{x}_{0}=\mathbf{0}\), show that the formula for \(\boldsymbol{x}_{t}\) is \[ \boldsymbol{x}_{t}=\boldsymbol{\mu}-\mathbf{Q}(\mathbf{I}-\alpha \mathbf{\Lambda})^{t} \mathbf{Q}^{\top} \boldsymbol{\mu} \]

Hence, deduce that a necessary condition for \(\boldsymbol{x}_{t}\) to converge is \(\alpha<2 / \max _{k} \lambda_{k}\).

(c) Show that the minimizer of \(f(\boldsymbol{x})+\frac{1}{2} \gamma\|\boldsymbol{x}\|^{2}\) can be written as \[ \boldsymbol{x}^{*}=\boldsymbol{\mu}-\mathbf{Q}\left(\mathbf{I}+\gamma^{-1} \boldsymbol{\Lambda}\right)^{-1} \mathbf{Q}^{\top} \boldsymbol{\mu} \]

(d) For a fixed value of \(t\), let the learning rate \(\alpha \downarrow 0\). Using part

(b) and (c), show that if \(\gamma\) \(\simeq 1 /(t \alpha)\) as \(\alpha \downarrow 0\), then \(x_{t} \simeq x^{*}\). In other words, \(x_{t}\) is approximately equal to \(\boldsymbol{x}^{*}\) for small \(\alpha\), provided that \(\gamma\) is inversely proportional to \(t \alpha\).

Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question

Data Science And Machine Learning Mathematical And Statistical Methods

ISBN: 9781118710852

1st Edition

Authors: Dirk P. Kroese, Thomas Taimre, Radislav Vaisman, Zdravko Botev

Question Posted: