Related papers: Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?

Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?

URL: http://arxiv.org/abs/2603.02069v1
Date: Mon, 02 Mar 2026 16:58:02 GMT
Title: Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?
Authors: Jihwan Kim, Dogyoon Song, Chulhee Yun,
Abstract summary: We study scaling laws of signSGD under a power-law random features (PLRF) model.<n>We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features.
Score: 35.79321975718977
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.

Related papers

Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation [20.44428092865608]
MCMC methods enable scalable posterior sampling but often suffer from sensitivity to minibatch size and gradient noise.<n>We propose Gradient Random Walk (SGLRW), an extension of the Lattice Random Walk discretization.
arXiv Detail & Related papers (2026-02-17T18:09:49Z)
Learning Curves of Stochastic Gradient Descent in Kernel Regression [7.063108005500741]
We analyze the single-pass Gradient Descent (SGD) in kernel regression under source condition.<n>Surprisingly, we show that SGD achieves min-max optimal rates up to constants among all the scales.<n>The main reason for SGD to overcome the curse of saturation is the exponentially decaying step size schedule.
arXiv Detail & Related papers (2025-05-28T07:16:11Z)
Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects [6.653325043862049]
We present an analysis of signSGD in a high dimensional limit.<n>We quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, gradient and noise reshaping.<n>We conclude with a conjecture on how these results might be extended to Adam.
arXiv Detail & Related papers (2024-11-19T00:24:50Z)
Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.<n>This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.<n>It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z)
Risk-Sensitive Diffusion: Robustly Optimizing Diffusion Models with Noisy Samples [58.68233326265417]
Non-image data are prevalent in real applications and tend to be noisy. Risk-sensitive SDE is a type of differential equation (SDE) parameterized by the risk vector. We conduct systematic studies for both Gaussian and non-Gaussian noise distributions.
arXiv Detail & Related papers (2024-02-03T08:41:51Z)
Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression [70.78523583702209]
We study training instabilities of behavior cloning with deep neural networks. We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards.
arXiv Detail & Related papers (2023-10-17T17:39:40Z)
On Convergence of Incremental Gradient for Non-Convex Smooth Functions [63.51187646914962]
In machine learning and network optimization, algorithms like shuffle SGD are popular due to minimizing the number of misses and good cache. This paper delves into the convergence properties SGD algorithms with arbitrary data ordering.
arXiv Detail & Related papers (2023-05-30T17:47:27Z)
Doubly Stochastic Models: Learning with Unbiased Label Noises and Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent. We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters. Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z)
Robustness to Unbounded Smoothness of Generalized SignSGD [25.07411035728305]
We show that momentum plays a critical role in analyzing SignSGD-type and Adamtype algorithms. We compare these algorithms with popular tasks, observing that we can match the performance of Adam while beating the others.
arXiv Detail & Related papers (2022-08-23T21:11:19Z)
Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression [122.70478935214128]
gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications. This paper provides problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize.
arXiv Detail & Related papers (2021-10-12T17:49:54Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.