Related papers: Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

URL: http://arxiv.org/abs/2306.02601v1
Date: Mon, 5 Jun 2023 05:21:01 GMT
Title: Aiming towards the minimizers: fast convergence of SGD for overparametrized problems
Authors: Chaoyue Liu, Dmitriy Drusvyatskiy, Mikhail Belkin, Damek Davis, Yi-An Ma
Abstract summary: We propose a regularity regime which endows the gradient method with the same worst-case complexity as the gradient method. All existing guarantees require the gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. We demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.
Score: 25.077446336619378
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a single sampled gradient (or a minibatch) in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.

Related papers

Parallel Diffusion Solver via Residual Dirichlet Policy Optimization [88.7827307535107]
Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature.<n>Existing solver-based acceleration methods often face significant image quality degradation under a low-dimensional budget.<n>We propose the Ensemble Parallel Direction solver (dubbed as EPD-EPr), a novel ODE solver that mitigates these errors by incorporating multiple gradient parallel evaluations in each step.
arXiv Detail & Related papers (2025-12-28T05:48:55Z)
Nearly Instance-Optimal Parameter Recovery from Many Trajectories via Hellinger Localization [19.74516565968901]
We extend the scope of instance-optimal rates in multi-trajectory settings via the Hellinger localization framework.<n>We instantiate our framework across four diverse case studies.<n>Our bounds nearly match the instance-optimal rates from normality, substantially improving over standard reductions.
arXiv Detail & Related papers (2025-10-07T20:12:04Z)
HyperFlow: Gradient-Free Emulation of Few-Shot Fine-Tuning [20.308785668386424]
We propose an approach that emulates gradient descent without computing gradients, enabling efficient test-time adaptation. Specifically, we formulate gradient descent as an Euler discretization of an ordinary differential equation (ODE) and train an auxiliary network to predict the task-conditional drift. The adaptation then reduces to a simple numerical integration, which requires only a few forward passes of the auxiliary network.
arXiv Detail & Related papers (2025-04-21T03:04:38Z)
Smoothing ADMM for Sparse-Penalized Quantile Regression with Non-Convex Penalties [8.294148737585543]
This paper investigates concave and clipped quantile regression in the presence of nonsecondary absolute and non-smooth convergence penalties. We introduce a novel-loop ADM algorithm with an increasing penalty multiplier, named SIAD, specifically for sparse regression.
arXiv Detail & Related papers (2023-09-04T21:48:51Z)
Continuous-Time Meta-Learning with Forward Mode Differentiation [65.26189016950343]
We introduce Continuous Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field. Treating the learning process as an ODE offers the notable advantage that the length of the trajectory is now continuous. We show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.
arXiv Detail & Related papers (2022-03-02T22:35:58Z)
Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z)
Spatio-Temporal Variational Gaussian Processes [26.60276485130467]
We introduce a scalable approach to Gaussian process inference that combinestemporal-temporal filtering with natural variational inference. We derive a sparse approximation that constructs a state-space model over a reduced set of inducing points. We show that for separable Markov kernels the full sparse cases recover exactly the standard variational GP.
arXiv Detail & Related papers (2021-11-02T16:53:31Z)
Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement. For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts. We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
Conditional gradient methods for stochastically constrained convex minimization [54.53786593679331]
We propose two novel conditional gradient-based methods for solving structured convex optimization problems. The most important feature of our framework is that only a subset of the constraints is processed at each iteration. Our algorithms rely on variance reduction and smoothing used in conjunction with conditional gradient steps, and are accompanied by rigorous convergence guarantees.
arXiv Detail & Related papers (2020-07-07T21:26:35Z)
Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search) [32.24244211281863]
We study a simplistic setting -- smooth, convex losses with models over- parameterized enough to interpolate the data. We prove that AMSGrad with constant step-size and momentum converges to the minimizer at a faster $O(1/T)$ rate. We show that these techniques improve the convergence and generalization of adaptive gradient methods across tasks.
arXiv Detail & Related papers (2020-06-11T21:23:30Z)
Stochastic gradient algorithms from ODE splitting perspective [0.0]
We present a different view on optimization, which goes back to the splitting schemes for approximate solutions of ODE. In this work, we provide a connection between descent approach and gradient first-order splitting scheme for ODE. We consider the special case of splitting, which is inspired by machine learning applications and derive a new upper bound on the global splitting error for it.
arXiv Detail & Related papers (2020-04-19T22:45:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.