Related papers: Dimension-adapted Momentum Outscales SGD

Dimension-adapted Momentum Outscales SGD

URL: http://arxiv.org/abs/2505.16098v1
Date: Thu, 22 May 2025 00:58:50 GMT
Title: Dimension-adapted Momentum Outscales SGD
Authors: Damien Ferbach, Katie Everett, Gauthier Gidel, Elliot Paquette, Courtney Paquette,
Abstract summary: We investigate scaling laws for momentum algorithms with small batch on the power law random model.<n>When trained with a momentum algorithm, our analysis reveals four distinct loss curve determined by varying data-target complexities.<n>While traditional gradient descent with momentum (SGDM) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents.
Score: 22.487084876365213
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by data complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA's improved loss exponents over SGD hold in a practical setting.

Related papers

Multi-Dimensional Visual Data Recovery: Scale-Aware Tensor Modeling and Accelerated Randomized Computation [51.65236537605077]
We propose a new type of network compression optimization technique, fully randomized tensor network compression (FCTN)<n>FCTN has significant advantages in correlation characterization and transpositional in algebra, and has notable achievements in multi-dimensional data processing and analysis.<n>We derive efficient algorithms with guarantees to solve the formulated models.
arXiv Detail & Related papers (2026-02-13T14:56:37Z)
When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling [15.148577493784051]
Gaussian equivalence theory (GET) states that the behavior of high-dimensional, complex features can be captured by Gaussian surrogates.<n>But numerical experiments show that this equivalence can fail even for simple embeddings under general scaling regimes.<n>We introduce a Conditional Equivalent (CGE) model, which can be viewed as appending a low-dimensional non-Gaussian component to an otherwise high-dimensional Gaussian model.
arXiv Detail & Related papers (2025-12-03T00:23:12Z)
Towards Scaling Laws for Symbolic Regression [45.609070591068836]
Symbolic regression aims to discover the underlying mathematical expressions that explain observed data.<n>Deep learning-based SR has recently become competitive with genetic programming approaches.<n>We present the first systematic investigation of scaling in SR, using a scalable end-to-end transformer pipeline.
arXiv Detail & Related papers (2025-10-30T01:36:44Z)
Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models. We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z)
Online Variational Sequential Monte Carlo [49.97673761305336]
We build upon the variational sequential Monte Carlo (VSMC) method, which provides computationally efficient and accurate model parameter estimation and Bayesian latent-state inference. Online VSMC is capable of performing efficiently, entirely on-the-fly, both parameter estimation and particle proposal adaptation.
arXiv Detail & Related papers (2023-12-19T21:45:38Z)
Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models [10.781866671930857]
We analyze the dynamics of streaming gradient descent (SGD) in the high-dimensional limit. We demonstrate a deterministic equivalent of SGD in the form of a system of ordinary differential equations. In addition to the deterministic equivalent, we introduce an SDE with a simplified diffusion coefficient.
arXiv Detail & Related papers (2023-08-17T13:33:02Z)
A Momentum-Incorporated Non-Negative Latent Factorization of Tensors Model for Dynamic Network Representation [0.0]
A large-scale dynamic network (LDN) is a source of data in many big data-related applications. A Latent factorization of tensors (LFT) model efficiently extracts this time pattern. LFT models based on gradient descent (SGD) solvers are often limited by training schemes and have poor tail convergence. This paper proposes a novel nonlinear LFT model (MNNL) based on momentum-ind SGD to make training unconstrained and compatible with general training schemes.
arXiv Detail & Related papers (2023-05-04T12:30:53Z)
NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively. We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models [0.2741266294612776]
We analyze a class of gradient algorithms with momentum on a high-dimensional random least squares problem. We show that (small-batch) momentum with a fixed momentum parameter provides no actual performance improvement over SGD when step sizes are adjusted correctly. In the non-strongly convex setting, it is possible to get a large improvement over SGD using momentum.
arXiv Detail & Related papers (2021-06-07T15:08:24Z)
Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models. We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data. We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z)
SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality [15.640534097470923]
We propose a new framework for analyzing the dynamics of gradient descent (SGD) when both number of samples and dimensions are large. Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit.
arXiv Detail & Related papers (2021-02-08T18:00:13Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution. We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z)
On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models. We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.