Related papers: A Physics-Inspired Optimizer: Velocity Regularized Adam

A Physics-Inspired Optimizer: Velocity Regularized Adam

URL: http://arxiv.org/abs/2505.13196v1
Date: Mon, 19 May 2025 14:51:40 GMT
Title: A Physics-Inspired Optimizer: Velocity Regularized Adam
Authors: Pranav Vaidhyanathan, Lucas Schorling, Natalia Ares, Michael A. Osborne,
Abstract summary: We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on system dynamics.<n>We benchmark various tasks such as image classification, language modeling, image generation and generative modeling using diverse architectures and training methodologies including CNNs, Transformers, and GFlowNets.
Score: 16.047084318753377
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so called adaptive edge of stability regime during training leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, damping oscillations and allowing for a more aggressive base step size when necessary without divergence. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam to create a hybrid optimizer, we demonstrate that VRAdam consistently exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, image generation and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.

Related papers

Langevin Flows for Modeling Neural Latent Dynamics [81.81271685018284]
We introduce LangevinFlow, a sequential Variational Auto-Encoder where the time evolution of latent variables is governed by the underdamped Langevin equation.<n>Our approach incorporates physical priors -- such as inertia, damping, a learned potential function, and forces -- to represent both autonomous and non-autonomous processes in neural systems.<n>Our method outperforms state-of-the-art baselines on synthetic neural populations generated by a Lorenz attractor.
arXiv Detail & Related papers (2025-07-15T17:57:48Z)
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training [67.45211108321203]
We introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer.<n>We show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs.
arXiv Detail & Related papers (2025-06-05T16:50:23Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
We propose a unified training framework for deep neural networks.<n>We introduce three instances of MARS that leverage preconditioned gradient optimization.<n>Results indicate that the implementation of MARS consistently outperforms Adam.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch [72.26822499434446]
Auto-Train-Once (ATO) is an innovative network pruning algorithm designed to automatically reduce the computational and storage costs of DNNs. We provide a comprehensive convergence analysis as well as extensive experiments, and the results show that our approach achieves state-of-the-art performance across various model architectures.
arXiv Detail & Related papers (2024-03-21T02:33:37Z)
Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter Initialization [3.1153758106426603]
We propose ActiveLR, an optimization meta algorithm that localizes the learning rate, $alpha$, and adapts them at each epoch according to whether the gradient at each epoch changes sign or not. We implement the Active version (ours) of widely used and recently published gradient descents, namely SGD with momentum, AdamW, RAdam, and AdaBelief.
arXiv Detail & Related papers (2023-01-24T16:57:00Z)
The Underlying Correlated Dynamics in Neural Training [6.385006149689549]
Training of neural networks is a computationally intensive task. We propose a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality. This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
arXiv Detail & Related papers (2022-12-18T08:34:11Z)
Flatter, faster: scaling momentum for optimal speedup of SGD [0.0]
We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks. We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
arXiv Detail & Related papers (2022-10-28T20:41:48Z)
Dynamics-aware Adversarial Attack of Adaptive Neural Networks [75.50214601278455]
We investigate the dynamics-aware adversarial attack problem of adaptive neural networks. We propose a Leaded Gradient Method (LGM) and show the significant effects of the lagged gradient. Our LGM achieves impressive adversarial attack performance compared with the dynamic-unaware attack methods.
arXiv Detail & Related papers (2022-10-15T01:32:08Z)
Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer. We show that there is a natural synergy between these two settings. We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z)
Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization [20.782428252187024]
We propose a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for optimization. Our proposed adaptive heavy ball momentum can improve gradient descent (SGD) and Adam. We verify the efficiency of SGD and Adam with the new adaptive momentum on extensive machine learning benchmarks, including image classification, language modeling, and machine translation.
arXiv Detail & Related papers (2021-10-18T07:03:48Z)
How to train your neural ODE: the world of Jacobian and kinetic regularization [7.83405844354125]
Training neural ODEs on large datasets has not been tractable due to the necessity of allowing the adaptive numerical ODE solver to refine its step size to very small values. We introduce a theoretically-grounded combination of both optimal transport and stability regularizations which encourage neural ODEs to prefer simpler dynamics out of all the dynamics that solve a problem well.
arXiv Detail & Related papers (2020-02-07T14:15:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.