Aiming towards the minimizers: fast convergence of SGD for
overparametrized problems
- URL: http://arxiv.org/abs/2306.02601v1
- Date: Mon, 5 Jun 2023 05:21:01 GMT
- Title: Aiming towards the minimizers: fast convergence of SGD for
overparametrized problems
- Authors: Chaoyue Liu, Dmitriy Drusvyatskiy, Mikhail Belkin, Damek Davis, Yi-An
Ma
- Abstract summary: We propose a regularity regime which endows the gradient method with the same worst-case complexity as the gradient method.
All existing guarantees require the gradient method to take small steps, thereby resulting in a much slower linear rate of convergence.
We demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.
- Score: 25.077446336619378
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern machine learning paradigms, such as deep learning, occur in or close
to the interpolation regime, wherein the number of model parameters is much
larger than the number of data samples. In this work, we propose a regularity
condition within the interpolation regime which endows the stochastic gradient
method with the same worst-case iteration complexity as the deterministic
gradient method, while using only a single sampled gradient (or a minibatch) in
each iteration. In contrast, all existing guarantees require the stochastic
gradient method to take small steps, thereby resulting in a much slower linear
rate of convergence. Finally, we demonstrate that our condition holds when
training sufficiently wide feedforward neural networks with a linear output
layer.
Related papers
- Smoothing ADMM for Sparse-Penalized Quantile Regression with Non-Convex
Penalties [8.294148737585543]
This paper investigates concave and clipped quantile regression in the presence of nonsecondary absolute and non-smooth convergence penalties.
We introduce a novel-loop ADM algorithm with an increasing penalty multiplier, named SIAD, specifically for sparse regression.
arXiv Detail & Related papers (2023-09-04T21:48:51Z) - Continuous-Time Meta-Learning with Forward Mode Differentiation [65.26189016950343]
We introduce Continuous Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field.
Treating the learning process as an ODE offers the notable advantage that the length of the trajectory is now continuous.
We show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.
arXiv Detail & Related papers (2022-03-02T22:35:58Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - Spatio-Temporal Variational Gaussian Processes [26.60276485130467]
We introduce a scalable approach to Gaussian process inference that combinestemporal-temporal filtering with natural variational inference.
We derive a sparse approximation that constructs a state-space model over a reduced set of inducing points.
We show that for separable Markov kernels the full sparse cases recover exactly the standard variational GP.
arXiv Detail & Related papers (2021-11-02T16:53:31Z) - Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Conditional gradient methods for stochastically constrained convex
minimization [54.53786593679331]
We propose two novel conditional gradient-based methods for solving structured convex optimization problems.
The most important feature of our framework is that only a subset of the constraints is processed at each iteration.
Our algorithms rely on variance reduction and smoothing used in conjunction with conditional gradient steps, and are accompanied by rigorous convergence guarantees.
arXiv Detail & Related papers (2020-07-07T21:26:35Z) - Adaptive Gradient Methods Converge Faster with Over-Parameterization
(but you should do a line-search) [32.24244211281863]
We study a simplistic setting -- smooth, convex losses with models over- parameterized enough to interpolate the data.
We prove that AMSGrad with constant step-size and momentum converges to the minimizer at a faster $O(1/T)$ rate.
We show that these techniques improve the convergence and generalization of adaptive gradient methods across tasks.
arXiv Detail & Related papers (2020-06-11T21:23:30Z) - Stochastic gradient algorithms from ODE splitting perspective [0.0]
We present a different view on optimization, which goes back to the splitting schemes for approximate solutions of ODE.
In this work, we provide a connection between descent approach and gradient first-order splitting scheme for ODE.
We consider the special case of splitting, which is inspired by machine learning applications and derive a new upper bound on the global splitting error for it.
arXiv Detail & Related papers (2020-04-19T22:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.