Related papers: On the Regularization Effect of Stochastic Gradient Descent applied to Least Squares

On the Regularization Effect of Stochastic Gradient Descent applied to Least Squares

URL: http://arxiv.org/abs/2007.13288v2
Date: Tue, 1 Sep 2020 20:34:27 GMT
Title: On the Regularization Effect of Stochastic Gradient Descent applied to Least Squares
Authors: Stefan Steinerberger
Abstract summary: We study the behavior of gradient descent applied to $|Ax -b |2 rightarrow min$ for invertible $A in mathbbRn times n$. We show that there is an explicit constant $c_A$ depending (mildly) on $A$ such that $$ mathbbE left| Ax_k+1-bright|2_2 leq.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the behavior of stochastic gradient descent applied to $\|Ax -b \|_2^2 \rightarrow \min$ for invertible $A \in \mathbb{R}^{n \times n}$. We show that there is an explicit constant $c_{A}$ depending (mildly) on $A$ such that $$ \mathbb{E} ~\left\| Ax_{k+1}-b\right\|^2_{2} \leq \left(1 + \frac{c_{A}}{\|A\|_F^2}\right) \left\|A x_k -b \right\|^2_{2} - \frac{2}{\|A\|_F^2} \left\|A^T A (x_k - x)\right\|^2_{2}.$$ This is a curious inequality: the last term has one more matrix applied to the residual $u_k - u$ than the remaining terms: if $x_k - x$ is mainly comprised of large singular vectors, stochastic gradient descent leads to a quick regularization. For symmetric matrices, this inequality has an extension to higher-order Sobolev spaces. This explains a (known) regularization phenomenon: an energy cascade from large singular values to small singular values smoothes.

Related papers

MLPs at the EOC: Concentration of the NTK [7.826806223782053]
We study the concentration of the Neural Tangent (NTK) $K_theta.<n>We prove that an approximate version of gradient independence holds at finite width.<n>Our results imply that in order to accurately approximate the limit, hidden layer widths have to grow quadratically as $m_k = k2 m$ for some $m in bbN+1$ for sufficient concentration.
arXiv Detail & Related papers (2025-01-24T18:58:50Z)
MLPs at the EOC: Spectrum of the NTK [7.826806223782053]
We study the properties of the Neuralstyle (NTK) $oversetscriptscriptstyleinftyK. $Delta_phi = fracb2a2+b2$ determines the rate at which the condition number of the NTK matrix converges to its limit as depth increases.
arXiv Detail & Related papers (2025-01-22T21:12:51Z)
The Communication Complexity of Approximating Matrix Rank [50.6867896228563]
We show that this problem has randomized communication complexity $Omega(frac1kcdot n2log|mathbbF|)$. As an application, we obtain an $Omega(frac1kcdot n2log|mathbbF|)$ space lower bound for any streaming algorithm with $k$ passes.
arXiv Detail & Related papers (2024-10-26T06:21:42Z)
Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch. We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work. We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z)
Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. In this work, we initiate the study of provably learning a multi-head attention layer from random examples. We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z)
On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [59.65871549878937]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T. Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$. Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z)
Spectral Statistics of the Sample Covariance Matrix for High Dimensional Linear Gaussians [12.524855369455421]
Performance of ordinary least squares(OLS) method for the emphestimation of high dimensional stable state transition matrix. OLS estimator incurs a emphphase transition and becomes emphtransient: increasing only worsens estimation error.
arXiv Detail & Related papers (2023-12-10T06:55:37Z)
Convergence of Alternating Gradient Descent for Matrix Factorization [5.439020425819001]
We consider alternating gradient descent (AGD) with fixed step size applied to the asymmetric matrix factorization objective. We show that, for a rank-$r$ matrix $mathbfA in mathbbRm times n$, smoothness $C$ in the complexity $T$ to be an absolute constant.
arXiv Detail & Related papers (2023-05-11T16:07:47Z)
Generalizations of Powers--Størmer's inequality [0.0]
mathrmtr|A-B|leq 2, mathrmtrbig(f(A)g(B)big) endalign* holds for every positive-valued matrix monotone function $f$. This study demonstrates that the set of functions satisfying this inequality includes additional elements and provides illustrative examples to support this claim.
arXiv Detail & Related papers (2023-02-15T17:59:01Z)
A Nearly-Optimal Bound for Fast Regression with $\ell_\infty$ Guarantee [16.409210914237086]
Given a matrix $Ain mathbbRntimes d$ and a tensor $bin mathbbRn$, we consider the regression problem with $ell_infty$ guarantees. We show that in order to obtain such $ell_infty$ guarantee for $ell$ regression, one has to use sketching matrices that are dense. We also develop a novel analytical framework for $ell_infty$ guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property
arXiv Detail & Related papers (2023-02-01T05:22:40Z)
Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm. Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z)
On the Self-Penalization Phenomenon in Feature Selection [69.16452769334367]
We describe an implicit sparsity-inducing mechanism based on over a family of kernels. As an application, we use this sparsity-inducing mechanism to build algorithms consistent for feature selection.
arXiv Detail & Related papers (2021-10-12T09:36:41Z)
Spectral properties of sample covariance matrices arising from random matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$. Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z)
A matrix concentration inequality for products [0.0]
We show that for small enough positive $alpha$, $Z_n$ satisfies the concentration inequality beginequationlabeleq:CTbound mathbbPleft(leftVert Z_n-mathbbEleft[Z_nright]rightVert geq tright) leq 2d2cdotexpleft(frac-t2alpha sigma2 right) quad textfor all
arXiv Detail & Related papers (2020-08-12T04:39:12Z)
On the robustness of the minimum $\ell_2$ interpolator [2.918940961856197]
We analyse the interpolator with minimal $ell$-norm $hatbeta$ in a general high dimensional linear regression framework. We prove that, with high probability, the prediction loss of this estimator is bounded from above by $(|beta*|2r_cn(Sigma)vee |xi|2)/n$, where $r_k(Sigma)sum_igeq klambda_i(Sigma)$ are the rests of the
arXiv Detail & Related papers (2020-03-12T15:12:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.