Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect
- URL: http://arxiv.org/abs/2110.03677v1
- Date: Thu, 7 Oct 2021 17:58:21 GMT
- Title: Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect
- Authors: Yuqing Wang, Minshuo Chen, Tuo Zhao, Molei Tao
- Abstract summary: We consider using Gradient Descent (GD) with a large learning rate on a homogeneous matrix factorization problem.
We prove a convergence theory for constant large learning rates well beyond $2/L$.
We rigorously establish an implicit bias of GD induced by such a large learning rate, termed 'balancing'
- Score: 43.00475513526005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent empirical advances show that training deep models with large learning
rate often improves generalization performance. However, theoretical
justifications on the benefits of large learning rate are highly limited, due
to challenges in analysis. In this paper, we consider using Gradient Descent
(GD) with a large learning rate on a homogeneous matrix factorization problem,
i.e., $\min_{X, Y} \|A - XY^\top\|_{\sf F}^2$. We prove a convergence theory
for constant large learning rates well beyond $2/L$, where $L$ is the largest
eigenvalue of Hessian at the initialization. Moreover, we rigorously establish
an implicit bias of GD induced by such a large learning rate, termed
'balancing', meaning that magnitudes of $X$ and $Y$ at the limit of GD
iterations will be close even if their initialization is significantly
unbalanced. Numerical experiments are provided to support our theory.
Related papers
- Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.
We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.
Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z) - Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective
and Improved Bounds [12.699376765058137]
gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning.
It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed.
We prove fine-grained complexity bounds that depend on the data matrix and are never worse than what is predicted by the existing bounds.
arXiv Detail & Related papers (2023-06-21T18:14:44Z) - Restricted Strong Convexity of Deep Learning Models with Smooth
Activations [31.003601717265006]
We study the problem of optimization of deep learning models with smooth activation functions.
We introduce a new analysis of optimization based on Restricted Strong Convexity (RSC)
Ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models.
arXiv Detail & Related papers (2022-09-29T21:24:26Z) - Lassoed Tree Boosting [53.56229983630983]
We prove that a gradient boosted tree algorithm with early stopping faster than $n-1/4$ L2 convergence in the large nonparametric space of cadlag functions of bounded sectional variation.
Our convergence proofs are based on a novel, general theorem on early stopping with empirical loss minimizers of nested Donsker classes.
arXiv Detail & Related papers (2022-05-22T00:34:41Z) - Biased Gradient Estimate with Drastic Variance Reduction for Meta
Reinforcement Learning [25.639542287310768]
biased gradient estimates are almost always implemented in practice, whereas prior theory on meta-RL only establishes convergence under unbiased gradient estimates.
We propose linearized score function (LSF) gradient estimates, which have bias $mathcalO (1/sqrtN)$ and variance $mathcalO (1/N)$.
We establish theoretical guarantees for the LSF gradient estimates in meta-RL regarding its convergence to stationary points, showing better dependency on $N$ than prior work when $N$ is large.
arXiv Detail & Related papers (2021-12-14T12:29:43Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Revisiting SGD with Increasingly Weighted Averaging: Optimization and
Generalization Perspectives [50.12802772165797]
The averaging technique combines all iterative solutions into a single solution.
Experiments have demonstrated trade-off and the effectiveness of averaging compared with other averaging schemes.
arXiv Detail & Related papers (2020-03-09T18:14:00Z) - A Precise High-Dimensional Asymptotic Theory for Boosting and
Minimum-$\ell_1$-Norm Interpolated Classifiers [3.167685495996986]
This paper establishes a precise high-dimensional theory for boosting on separable data.
Under a class of statistical models, we provide an exact analysis of the universality error of boosting.
We also explicitly pin down the relation between the boosting test error and the optimal Bayes error.
arXiv Detail & Related papers (2020-02-05T00:24:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.