Related papers: Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise

Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise

URL: http://arxiv.org/abs/2411.15958v1
Date: Sun, 24 Nov 2024 19:07:31 GMT
Title: Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise
Authors: Enea Monzio Compagnoni, Tianlin Liu, Rustem Islamov, Frank Norbert Proske, Antonio Orvieto, Aurelien Lucchi,
Abstract summary: This work introduces novel SDEs for commonly used adaptive adaptives: SignSGD, RMSprop(W), and Adam(W) These SDEs offer a quantitatively accurate description of theses and help illuminate an intricate relationship between adaptivity, curvature noise, and gradient. We believe our approach can provide valuable insights into best training practices and novel scaling rules.
Score: 15.535139686653611
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the vast empirical evidence supporting the efficacy of adaptive optimization methods in deep learning, their theoretical understanding is far from complete. This work introduces novel SDEs for commonly used adaptive optimizers: SignSGD, RMSprop(W), and Adam(W). These SDEs offer a quantitatively accurate description of these optimizers and help illuminate an intricate relationship between adaptivity, gradient noise, and curvature. Our novel analysis of SignSGD highlights a noteworthy and precise contrast to SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tail noise. We extend this analysis to AdamW and RMSpropW, for which we observe that the role of noise is much more complex. Crucially, we support our theoretical analysis with experimental evidence by verifying our insights: this includes numerically integrating our SDEs using Euler-Maruyama discretization on various neural network architectures such as MLPs, CNNs, ResNets, and Transformers. Our SDEs accurately track the behavior of the respective optimizers, especially when compared to previous SDEs derived for Adam and RMSprop. We believe our approach can provide valuable insights into best training practices and novel scaling rules.

Related papers

Stochastic Optimization with Optimal Importance Sampling [49.484190237840714]
We propose an iterative-based algorithm that jointly updates the decision and the IS distribution without requiring time-scale separation between the two. Our method achieves the lowest possible variable variance and guarantees global convergence under convexity of the objective and mild assumptions on the IS distribution family.
arXiv Detail & Related papers (2025-04-04T16:10:18Z)
Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations [5.4087282763977855]
We propose Physics-Informed Gaussians (PIGs), which combine feature embeddings using Gaussian functions with a lightweight neural network. Our approach uses trainable parameters for the mean and variance of each Gaussian, allowing for dynamic adjustment of their positions and shapes during training. Experimental results show the competitive performance of our model across various PDEs, demonstrating its potential as a robust tool for solving complex PDEs.
arXiv Detail & Related papers (2024-12-08T16:58:29Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. We increase the consistency and informativeness of the pairwise preference signals through targeted modifications. We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Adaptive Friction in Deep Learning: Enhancing Optimizers with Sigmoid and Tanh Function [0.0]
We introduce sigSignGrad and tanhSignGrad, two novel gradients that integrate adaptive friction coefficients. Our theoretical analysis demonstrates the wide-ranging adjustment capability of the friction coefficient S. Experiments on CIFAR-10, Mini-Image-Net using ResNet50 and ViT architectures confirm the superior performance our proposeds.
arXiv Detail & Related papers (2024-08-07T03:20:46Z)
Adam with model exponential moving average is effective for nonconvex optimization [45.242009309234305]
We offer a theoretical analysis of two modern optimization techniques for training large and complex models: (i) adaptive optimization algorithms as Adam, and (ii) the exponential moving average (EMA) model.
arXiv Detail & Related papers (2024-05-28T14:08:04Z)
Variational Inference for SDEs Driven by Fractional Noise [16.434973057669676]
We present a novel variational framework for performing inference in (neural) differential equations (SDEs) driven by Markov-approximate fractional Brownian motion (fBM) We propose the use of neural networks to learn the drift, diffusion and control terms within our variational posterior leading to the variational training of neural-SDEs.
arXiv Detail & Related papers (2023-10-19T17:59:21Z)
Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework. We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z)
SING: A Plug-and-Play DNN Learning Technique [25.563053353709627]
We propose SING (StabIlized and Normalized Gradient), a plug-and-play technique that improves the stability and robustness of Adam(W) SING is straightforward to implement and has minimal computational overhead, requiring only a layer-wise standardization of gradients fed to Adam(W)
arXiv Detail & Related papers (2023-05-25T12:39:45Z)
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms [45.007261870784475]
Approxing Gradient Descent (SGD) as a Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of correctness as well as experimental validation of their applicability.
arXiv Detail & Related papers (2022-05-20T16:39:03Z)
Optimizing Information-theoretical Generalization Bounds via Anisotropic Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD. We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z)
Robust Optimal Transport with Applications in Generative Modeling and Domain Adaptation [120.69747175899421]
Optimal Transport (OT) distances such as Wasserstein have been used in several areas such as GANs and domain adaptation. We propose a computationally-efficient dual form of the robust OT optimization that is amenable to modern deep learning applications. Our approach can train state-of-the-art GAN models on noisy datasets corrupted with outlier distributions.
arXiv Detail & Related papers (2020-10-12T17:13:40Z)
Stochastic-Sign SGD for Federated Learning with Theoretical Guarantees [49.91477656517431]
Quantization-based solvers have been widely adopted in Federated Learning (FL) No existing methods enjoy all the aforementioned properties. We propose an intuitively-simple yet theoretically-simple method based on SIGNSGD to bridge the gap.
arXiv Detail & Related papers (2020-02-25T15:12:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.