A Physics-Inspired Optimizer: Velocity Regularized Adam
- URL: http://arxiv.org/abs/2505.13196v1
- Date: Mon, 19 May 2025 14:51:40 GMT
- Title: A Physics-Inspired Optimizer: Velocity Regularized Adam
- Authors: Pranav Vaidhyanathan, Lucas Schorling, Natalia Ares, Michael A. Osborne,
- Abstract summary: We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on system dynamics.<n>We benchmark various tasks such as image classification, language modeling, image generation and generative modeling using diverse architectures and training methodologies including CNNs, Transformers, and GFlowNets.
- Score: 16.047084318753377
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so called adaptive edge of stability regime during training leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, damping oscillations and allowing for a more aggressive base step size when necessary without divergence. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam to create a hybrid optimizer, we demonstrate that VRAdam consistently exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, image generation and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.
Related papers
- Adaptive Surrogate-Based Strategy for Accelerating Convergence Speed when Solving Expensive Unconstrained Multi-Objective Optimisation Problems [41.99844472131922]
We propose an adaptive surrogate modelling approach designed to accelerate the early-stage convergence speed of state-of-the-art MOEAs.<n>This is important because it ensures that a solver can identify optimal or near-optimal solutions with relatively few fitness function evaluations.<n>Our approach was tested on 31 widely known benchmark problems and a real-world North Sea fish abundance modelling case study.
arXiv Detail & Related papers (2026-01-29T15:46:52Z) - Implicitly Restarted Lanczos Enables Chemically-Accurate Shallow Neural Quantum States [4.039934762896615]
We introduce the implicitly restarted Lanczos (IRL) method as the core engine for neural quantum states (NQS) training.<n>Our key innovation is an inherently stable second-order optimization framework that recasts the ill-conditioned parameter update problem into a small, well-posed Hermitian eigenvalue problem.<n>We demonstrate that IRL enables shallow NQS to consistently achieve extreme precision (1e-12 kcal/mol) in just 3 to 5 optimization steps.<n>For the F2 molecule, this translates to an approximate 17,900-fold speed-up in total runtime compared to Adam.
arXiv Detail & Related papers (2026-01-04T08:49:24Z) - Gradient Descent as a Perceptron Algorithm: Understanding Dynamics and Implicit Acceleration [67.12978375116599]
We show that the steps of gradient descent (GD) reduce to those of generalized perceptron algorithms.<n>This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks.
arXiv Detail & Related papers (2025-12-12T14:16:35Z) - Langevin Flows for Modeling Neural Latent Dynamics [81.81271685018284]
We introduce LangevinFlow, a sequential Variational Auto-Encoder where the time evolution of latent variables is governed by the underdamped Langevin equation.<n>Our approach incorporates physical priors -- such as inertia, damping, a learned potential function, and forces -- to represent both autonomous and non-autonomous processes in neural systems.<n>Our method outperforms state-of-the-art baselines on synthetic neural populations generated by a Lorenz attractor.
arXiv Detail & Related papers (2025-07-15T17:57:48Z) - MesaNet: Sequence Modeling by Locally Optimal Test-Time Training [67.45211108321203]
We introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer.<n>We show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs.
arXiv Detail & Related papers (2025-06-05T16:50:23Z) - Conformal Symplectic Optimization for Stable Reinforcement Learning [21.491621524500736]
By utilizing relativistic kinetic energy, RAD incorporates from special relativity and limits parameter updates below a finite speed, effectively mitigating abnormal influences.<n> Notably, RAD achieves up to a 155.1% performance improvement, showcasing its efficacy in training Atari games.
arXiv Detail & Related papers (2024-12-03T09:07:31Z) - MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
We propose a unified training framework for deep neural networks.<n>We introduce three instances of MARS that leverage preconditioned gradient optimization.<n>Results indicate that the implementation of MARS consistently outperforms Adam.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch [72.26822499434446]
Auto-Train-Once (ATO) is an innovative network pruning algorithm designed to automatically reduce the computational and storage costs of DNNs.
We provide a comprehensive convergence analysis as well as extensive experiments, and the results show that our approach achieves state-of-the-art performance across various model architectures.
arXiv Detail & Related papers (2024-03-21T02:33:37Z) - NeuralStagger: Accelerating Physics-constrained Neural PDE Solver with
Spatial-temporal Decomposition [67.46012350241969]
This paper proposes a general acceleration methodology called NeuralStagger.
It decomposing the original learning tasks into several coarser-resolution subtasks.
We demonstrate the successful application of NeuralStagger on 2D and 3D fluid dynamics simulations.
arXiv Detail & Related papers (2023-02-20T19:36:52Z) - Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter
Initialization [3.1153758106426603]
We propose ActiveLR, an optimization meta algorithm that localizes the learning rate, $alpha$, and adapts them at each epoch according to whether the gradient at each epoch changes sign or not.
We implement the Active version (ours) of widely used and recently published gradient descents, namely SGD with momentum, AdamW, RAdam, and AdaBelief.
arXiv Detail & Related papers (2023-01-24T16:57:00Z) - The Underlying Correlated Dynamics in Neural Training [6.385006149689549]
Training of neural networks is a computationally intensive task.
We propose a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality.
This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
arXiv Detail & Related papers (2022-12-18T08:34:11Z) - Flatter, faster: scaling momentum for optimal speedup of SGD [0.0]
We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks.
We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
arXiv Detail & Related papers (2022-10-28T20:41:48Z) - Dynamics-aware Adversarial Attack of Adaptive Neural Networks [75.50214601278455]
We investigate the dynamics-aware adversarial attack problem of adaptive neural networks.
We propose a Leaded Gradient Method (LGM) and show the significant effects of the lagged gradient.
Our LGM achieves impressive adversarial attack performance compared with the dynamic-unaware attack methods.
arXiv Detail & Related papers (2022-10-15T01:32:08Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Training Deep Neural Networks with Adaptive Momentum Inspired by the
Quadratic Optimization [20.782428252187024]
We propose a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for optimization.
Our proposed adaptive heavy ball momentum can improve gradient descent (SGD) and Adam.
We verify the efficiency of SGD and Adam with the new adaptive momentum on extensive machine learning benchmarks, including image classification, language modeling, and machine translation.
arXiv Detail & Related papers (2021-10-18T07:03:48Z) - Robust Value Iteration for Continuous Control Tasks [99.00362538261972]
When transferring a control policy from simulation to a physical system, the policy needs to be robust to variations in the dynamics to perform well.
We present Robust Fitted Value Iteration, which uses dynamic programming to compute the optimal value function on the compact state domain.
We show that robust value is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm.
arXiv Detail & Related papers (2021-05-25T19:48:35Z) - How to train your neural ODE: the world of Jacobian and kinetic
regularization [7.83405844354125]
Training neural ODEs on large datasets has not been tractable due to the necessity of allowing the adaptive numerical ODE solver to refine its step size to very small values.
We introduce a theoretically-grounded combination of both optimal transport and stability regularizations which encourage neural ODEs to prefer simpler dynamics out of all the dynamics that solve a problem well.
arXiv Detail & Related papers (2020-02-07T14:15:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.