Gradient Regularized Natural Gradients
- URL: http://arxiv.org/abs/2601.18420v1
- Date: Mon, 26 Jan 2026 12:25:04 GMT
- Title: Gradient Regularized Natural Gradients
- Authors: Satya Prakash Dash, Hossein Abdi, Wei Pan, Samuel Kaski, Mingfei Sun,
- Abstract summary: We propose a family of scalable second-order gradients that integrate explicit regularization with natural gradient updates.<n>We establish convergence guarantees for GRNG, showing that gradient regularization improves stability and enables convergence to global minima.
- Score: 20.85716825925689
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Gradient regularization (GR) has been shown to improve the generalizability of trained models. While Natural Gradient Descent has been shown to accelerate optimization in the initial phase of training, little attention has been paid to how the training dynamics of second-order optimizers can benefit from GR. In this work, we propose Gradient-Regularized Natural Gradients (GRNG), a family of scalable second-order optimizers that integrate explicit gradient regularization with natural gradient updates. Our framework provides two complementary algorithms: a frequentist variant that avoids explicit inversion of the Fisher Information Matrix (FIM) via structured approximations, and a Bayesian variant based on a Regularized-Kalman formulation that eliminates the need for FIM inversion entirely. We establish convergence guarantees for GRNG, showing that gradient regularization improves stability and enables convergence to global minima. Empirically, we demonstrate that GRNG consistently enhances both optimization speed and generalization compared to first-order methods (SGD, AdamW) and second-order baselines (K-FAC, Sophia), with strong results on vision and language benchmarks. Our findings highlight gradient regularization as a principled and practical tool to unlock the robustness of natural gradient methods for large-scale deep learning.
Related papers
- Provably Convergent Decentralized Optimization over Directed Graphs under Generalized Smoothness [1.5892054128426507]
Hessian norm is allowed to grow linearly with the gradient norm, thereby accommodating rapidly varying gradients beyond Lipschitz smoothness.<n>We integrate gradient-tracking techniques with gradient clipping to ensure accurate convergence over directed communication graphs.<n>Our results remain valid even when the gradient dissimilarity is unbounded, making the proposed framework more applicable to realistic heterogeneous data environments.
arXiv Detail & Related papers (2026-01-07T04:25:33Z) - Gradient-Normalized Smoothness for Optimization with Approximate Hessians [37.1630298053787]
We develop algorithms that use approximate second-order information combined with the gradient regularization technique to achieve fast global convergence rates.<n>We demonstrate direct applications of our results in logistic regression problems with Hessians, as well as in non-continuous soft convergence optimization using Fisher and Gauss.
arXiv Detail & Related papers (2025-06-16T17:19:34Z) - Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions.
Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms.
We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z) - Signal Processing Meets SGD: From Momentum to Filter [6.751292200515355]
In deep learning, gradient descent (SGD) and its momentum-based variants are widely used for optimization.<n>In this paper, we analyze gradient behavior through a signal processing lens, isolating key factors that influence updates.<n>We introduce a novel method SGDF based on Wiener Filter principles, which derives an optimal time-varying gain to refine updates.
arXiv Detail & Related papers (2023-11-06T01:41:46Z) - Neural Gradient Learning and Optimization for Oriented Point Normal
Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation.
We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors.
Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z) - NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements.
The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively.
We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL)
New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z) - AsymptoticNG: A regularized natural gradient optimization algorithm with
look-ahead strategy [37.638447128733546]
We present a regularized natural gradient with look-ahead strategy, named natural gradient (ANG)
ANG dynamic assembles NG and Euclidean gradient and updates parameters along the new direction using the intensity of NG.
Validation experiments show that ANG can update smoothly and stably at the second-order speed, and achieve better performance.
arXiv Detail & Related papers (2020-12-24T03:03:10Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.