Related papers: FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information

FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information

URL: http://arxiv.org/abs/2405.12807v9
Date: Sun, 4 Aug 2024 03:55:24 GMT
Title: FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information
Authors: Dongseong Hwang,
Abstract summary: We provide an accessible and detailed analysis of the diagonal empirical Fisher information matrix (FIM) in Adam. Our analysis uncovers flaws in the original Adam algorithm, leading to proposed corrections. Our modified algorithm, Fisher Adam (FAdam), demonstrates superior performance across diverse domains.
Score: 5.010523239708004
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper establishes a mathematical foundation for the Adam optimizer, elucidating its connection to natural gradient descent through Riemannian and information geometry. We provide an accessible and detailed analysis of the diagonal empirical Fisher information matrix (FIM) in Adam, clarifying all detailed approximations and advocating for the use of log probability functions as loss, which should be based on discrete distributions, due to the limitations of empirical FIM. Our analysis uncovers flaws in the original Adam algorithm, leading to proposed corrections such as enhanced momentum calculations, adjusted bias corrections, adaptive epsilon, and gradient clipping. We refine the weight decay term based on our theoretical framework. Our modified algorithm, Fisher Adam (FAdam), demonstrates superior performance across diverse domains including LLM, ASR, and VQ-VAE, achieving state-of-the-art results in ASR.

Related papers

BAPE: Learning an Explicit Bayes Classifier for Long-tailed Visual Recognition [78.70453964041718]
Current deep learning algorithms usually solve for the optimal classifier by emphimplicitly estimating the posterior probabilities.<n>This simple methodology has been proven effective for meticulously balanced academic benchmark datasets.<n>However, it is not applicable to the long-tailed data distributions in the real world.<n>This paper presents a novel approach (BAPE) that provides a more precise theoretical estimation of the data distributions.
arXiv Detail & Related papers (2025-06-29T15:12:50Z)
A Simplified Analysis of SGD for Linear Regression with Weight Averaging [64.2393952273612]
Recent work bycitetzou 2021benign provides sharp rates for SGD optimization in linear regression using constant learning rate.<n>We provide a simplified analysis recovering the same bias and variance bounds provided incitepzou 2021benign based on simple linear algebra tools.<n>We believe our work makes the analysis of gradient descent on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling.
arXiv Detail & Related papers (2025-06-18T15:10:38Z)
Covariates-Adjusted Mixed-Membership Estimation: A Novel Network Model with Optimal Guarantees [3.6936359356095454]
This paper addresses the problem of estimation in networks, where the goal is to efficiently estimate the latent mixed-membership structure from the network. We propose a novel model that incorporates both information, and similarities to the node co-membership model. We show that our approach achieves optimal accuracy for both the similarity matrix and the Frobenius norm entry loss.
arXiv Detail & Related papers (2025-02-10T16:56:00Z)
WarpAdam: A new Adam optimizer based on Meta-Learning approach [0.0]
This study introduces an innovative approach that merges the 'warped gradient descend' concept from Meta Learning with the Adam. By introducing a learnable distortion matrix P within the adaptation matrix P, we aim to enhance the model's capability across diverse data distributions. Our research showcases potential of this novel approach through theoretical insights and empirical evaluations.
arXiv Detail & Related papers (2024-09-06T12:51:10Z)
Out of the Ordinary: Spectrally Adapting Regression for Covariate Shift [12.770658031721435]
We propose a method for adapting the weights of the last layer of a pre-trained neural regression model to perform better on input data originating from a different distribution. We demonstrate how this lightweight spectral adaptation procedure can improve out-of-distribution performance for synthetic and real-world datasets.
arXiv Detail & Related papers (2023-12-29T04:15:58Z)
Curvature-Independent Last-Iterate Convergence for Games on Riemannian Manifolds [77.4346324549323]
We show that a step size agnostic to the curvature of the manifold achieves a curvature-independent and linear last-iterate convergence rate. To the best of our knowledge, the possibility of curvature-independent rates and/or last-iterate convergence has not been considered before.
arXiv Detail & Related papers (2023-06-29T01:20:44Z)
Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions. We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z)
An Adam-enhanced Particle Swarm Optimizer for Latent Factor Analysis [6.960453648000231]
We propose an Adam-enhanced Hierarchical PSO-LFA model, which refines the latent factors with a sequential PSO algorithm. The experimental results on four real datasets demonstrate that our proposed model achieves higher prediction accuracy with its peers.
arXiv Detail & Related papers (2023-02-23T12:10:59Z)
Optimizing Information-theoretical Generalization Bounds via Anisotropic Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD. We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
On the Variance of the Fisher Information for Deep Learning [79.71410479830222]
The Fisher information matrix (FIM) has been applied to the realm of deep learning. The exact FIM is either unavailable in closed form or too expensive to compute. We investigate two such estimators based on two equivalent representations of the FIM.
arXiv Detail & Related papers (2021-07-09T04:46:50Z)
Two-Level K-FAC Preconditioning for Deep Learning [7.699428789159717]
In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Gradient Descent. In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix. One particularly successful variant of such methods is the so-called K-FAC, which uses a Kronecker-ed block-factored preconditioner.
arXiv Detail & Related papers (2020-11-01T17:54:21Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.