On the Regularization Effect of Stochastic Gradient Descent applied to
Least Squares
- URL: http://arxiv.org/abs/2007.13288v2
- Date: Tue, 1 Sep 2020 20:34:27 GMT
- Title: On the Regularization Effect of Stochastic Gradient Descent applied to
Least Squares
- Authors: Stefan Steinerberger
- Abstract summary: We study the behavior of gradient descent applied to $|Ax -b |2 rightarrow min$ for invertible $A in mathbbRn times n$.
We show that there is an explicit constant $c_A$ depending (mildly) on $A$ such that $$ mathbbE left| Ax_k+1-bright|2_2 leq.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the behavior of stochastic gradient descent applied to $\|Ax -b
\|_2^2 \rightarrow \min$ for invertible $A \in \mathbb{R}^{n \times n}$. We
show that there is an explicit constant $c_{A}$ depending (mildly) on $A$ such
that $$ \mathbb{E} ~\left\| Ax_{k+1}-b\right\|^2_{2} \leq \left(1 +
\frac{c_{A}}{\|A\|_F^2}\right) \left\|A x_k -b \right\|^2_{2} -
\frac{2}{\|A\|_F^2} \left\|A^T A (x_k - x)\right\|^2_{2}.$$ This is a curious
inequality: the last term has one more matrix applied to the residual $u_k - u$
than the remaining terms: if $x_k - x$ is mainly comprised of large singular
vectors, stochastic gradient descent leads to a quick regularization. For
symmetric matrices, this inequality has an extension to higher-order Sobolev
spaces. This explains a (known) regularization phenomenon: an energy cascade
from large singular values to small singular values smoothes.
Related papers
- Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [59.65871549878937]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.
Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.
Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z) - Spectral Statistics of the Sample Covariance Matrix for High Dimensional
Linear Gaussians [12.524855369455421]
Performance of ordinary least squares(OLS) method for the emphestimation of high dimensional stable state transition matrix.
OLS estimator incurs a emphphase transition and becomes emphtransient: increasing only worsens estimation error.
arXiv Detail & Related papers (2023-12-10T06:55:37Z) - Convergence of Alternating Gradient Descent for Matrix Factorization [5.439020425819001]
We consider alternating gradient descent (AGD) with fixed step size applied to the asymmetric matrix factorization objective.
We show that, for a rank-$r$ matrix $mathbfA in mathbbRm times n$, smoothness $C$ in the complexity $T$ to be an absolute constant.
arXiv Detail & Related papers (2023-05-11T16:07:47Z) - Generalizations of Powers--Størmer's inequality [0.0]
mathrmtr|A-B|leq 2, mathrmtrbig(f(A)g(B)big) endalign* holds for every positive-valued matrix monotone function $f$.
This study demonstrates that the set of functions satisfying this inequality includes additional elements and provides illustrative examples to support this claim.
arXiv Detail & Related papers (2023-02-15T17:59:01Z) - A Nearly-Optimal Bound for Fast Regression with $\ell_\infty$ Guarantee [16.409210914237086]
Given a matrix $Ain mathbbRntimes d$ and a tensor $bin mathbbRn$, we consider the regression problem with $ell_infty$ guarantees.
We show that in order to obtain such $ell_infty$ guarantee for $ell$ regression, one has to use sketching matrices that are dense.
We also develop a novel analytical framework for $ell_infty$ guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property
arXiv Detail & Related papers (2023-02-01T05:22:40Z) - Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm.
Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z) - On the Self-Penalization Phenomenon in Feature Selection [69.16452769334367]
We describe an implicit sparsity-inducing mechanism based on over a family of kernels.
As an application, we use this sparsity-inducing mechanism to build algorithms consistent for feature selection.
arXiv Detail & Related papers (2021-10-12T09:36:41Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - On the robustness of the minimum $\ell_2$ interpolator [2.918940961856197]
We analyse the interpolator with minimal $ell$-norm $hatbeta$ in a general high dimensional linear regression framework.
We prove that, with high probability, the prediction loss of this estimator is bounded from above by $(|beta*|2r_cn(Sigma)vee |xi|2)/n$, where $r_k(Sigma)sum_igeq klambda_i(Sigma)$ are the rests of the
arXiv Detail & Related papers (2020-03-12T15:12:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.