Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects
- URL: http://arxiv.org/abs/2411.12135v2
- Date: Fri, 21 Feb 2025 17:38:07 GMT
- Title: Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects
- Authors: Ke Liang Xiao, Noah Marshall, Atish Agarwala, Elliot Paquette,
- Abstract summary: We present an analysis of signSGD in a high dimensional limit.<n>We quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, gradient and noise reshaping.<n>We conclude with a conjecture on how these results might be extended to Adam.
- Score: 6.653325043862049
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, signSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that signSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of signSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.
Related papers
- Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD? [35.79321975718977]
We study scaling laws of signSGD under a power-law random features (PLRF) model.<n>We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features.
arXiv Detail & Related papers (2026-03-02T16:58:02Z) - Learn Beneficial Noise as Graph Augmentation [54.44813218411879]
We propose PiNGDA, where positive-incentive noise (pi-noise) scientifically analyzes the beneficial effect of noise under the information theory.<n>We prove that the standard GCL with pre-defined augmentations is equivalent to estimate the beneficial noise via the point estimation.<n>Since the generator learns how to produce beneficial perturbations on graph topology and node attributes, PiNGDA is more reliable compared with the existing methods.
arXiv Detail & Related papers (2025-05-25T08:20:34Z) - Noise Augmented Fine Tuning for Mitigating Hallucinations in Large Language Models [1.0579965347526206]
Large language models (LLMs) often produce inaccurate or misleading content-hallucinations.
Noise-Augmented Fine-Tuning (NoiseFiT) is a novel framework that leverages adaptive noise injection to enhance model robustness.
NoiseFiT selectively perturbs layers identified as either high-SNR (more robust) or low-SNR (potentially under-regularized) using a dynamically scaled Gaussian noise.
arXiv Detail & Related papers (2025-04-04T09:27:19Z) - Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise [20.922456964393213]
We establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed noise.
For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of momentum and heavy tails.
We develop a uniform-in-time discretization error bound, which to our knowledge, is the first result of its kind for SDEs with degenerate noise.
arXiv Detail & Related papers (2025-02-02T19:25:48Z) - Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise [15.535139686653611]
This work introduces novel SDEs for commonly used adaptive adaptives: SignSGD, RMSprop(W), and Adam(W)
These SDEs offer a quantitatively accurate description of theses and help illuminate an intricate relationship between adaptivity, curvature noise, and gradient.
We believe our approach can provide valuable insights into best training practices and novel scaling rules.
arXiv Detail & Related papers (2024-11-24T19:07:31Z) - Iso-Diffusion: Improving Diffusion Probabilistic Models Using the Isotropy of the Additive Gaussian Noise [0.0]
We show how to use the isotropy of the additive noise as a constraint on the objective function to enhance the fidelity of DDPMs.
Our approach is simple and can be applied to any DDPM variant.
arXiv Detail & Related papers (2024-03-25T14:05:52Z) - Robust Estimation of Causal Heteroscedastic Noise Models [7.568978862189266]
Student's $t$-distribution is known for its robustness in accounting for sampling variability with smaller sample sizes and extreme values without significantly altering the overall distribution shape.
Our empirical evaluations demonstrate that our estimators are more robust and achieve better overall performance across synthetic and real benchmarks.
arXiv Detail & Related papers (2023-12-15T02:26:35Z) - Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning
and Autoregression [70.78523583702209]
We study training instabilities of behavior cloning with deep neural networks.
We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards.
arXiv Detail & Related papers (2023-10-17T17:39:40Z) - Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances [0.0]
gradient descent (SGD) has become a cornerstone of neural network optimization.
We investigate the effects of epoch-based noise correlations on the stationary distribution of discrete-time SGD with momentum.
arXiv Detail & Related papers (2023-06-08T15:45:57Z) - Partial Identification with Noisy Covariates: A Robust Optimization
Approach [94.10051154390237]
Causal inference from observational datasets often relies on measuring and adjusting for covariates.
We show that this robust optimization approach can extend a wide range of causal adjustment methods to perform partial identification.
Across synthetic and real datasets, we find that this approach provides ATE bounds with a higher coverage probability than existing methods.
arXiv Detail & Related papers (2022-02-22T04:24:26Z) - Optimizing Information-theoretical Generalization Bounds via Anisotropic
Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD.
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z) - Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections [73.95786440318369]
We focus on the so-called implicit effect' of GNIs, which is the effect of the injected noise on the dynamics of gradient descent (SGD)
We show that this effect induces an asymmetric heavy-tailed noise on gradient updates.
We then formally prove that GNIs induce an implicit bias', which varies depending on the heaviness of the tails and the level of asymmetry.
arXiv Detail & Related papers (2021-02-13T21:28:09Z) - On Dynamic Noise Influence in Differentially Private Learning [102.6791870228147]
Private Gradient Descent (PGD) is a commonly used private learning framework, which noises based on the Differential protocol.
Recent studies show that emphdynamic privacy schedules can improve at the final iteration, yet yet theoreticals of the effectiveness of such schedules remain limited.
This paper provides comprehensive analysis of noise influence in dynamic privacy schedules to answer these critical questions.
arXiv Detail & Related papers (2021-01-19T02:04:00Z) - Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models.
We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise.
Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.