AdaFisher: Adaptive Second Order Optimization via Fisher Information
- URL: http://arxiv.org/abs/2405.16397v2
- Date: Thu, 17 Oct 2024 23:51:23 GMT
- Title: AdaFisher: Adaptive Second Order Optimization via Fisher Information
- Authors: Damien Martins Gomes, Yanlei Zhang, Eugene Belilovsky, Guy Wolf, Mahdi S. Hosseini,
- Abstract summary: We present AdaFisher, an adaptive second-order that leverages a block-diagonal approximation to the Fisher information matrix for adaptive preconditioning gradient.
We demonstrate that AdaFisher outperforms the SOTAs in terms of both accuracy and convergence speed.
- Score: 22.851200800265914
- License:
- Abstract: First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed. Code is available from https://github.com/AtlasAnalyticsLab/AdaFisher.
Related papers
- MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Understanding Optimization in Deep Learning with Central Flows [53.66160508990508]
We show that an RMS's implicit behavior can be explicitly captured by a "central flow:" a differential equation.
We show that these flows can empirically predict long-term optimization trajectories of generic neural networks.
arXiv Detail & Related papers (2024-10-31T17:58:13Z) - Efficient Second-Order Neural Network Optimization via Adaptive Trust Region Methods [0.0]
SecondOrderAdaptive (SOAA) is a novel optimization algorithm designed to overcome limitations of traditional second-order techniques.
We empirically demonstrate that SOAA achieves faster and more stable convergence compared to first-order approximations.
arXiv Detail & Related papers (2024-10-03T08:23:06Z) - SGD with Partial Hessian for Deep Neural Networks Optimization [18.78728272603732]
We propose a compound, which is a combination of a second-order with a precise partial Hessian matrix for updating channel-wise parameters and the first-order gradient descent (SGD) algorithms for updating the other parameters.
Compared with first-orders, it adopts a certain amount of information from the Hessian matrix to assist optimization, while compared with the existing second-order generalizations, it keeps the good performance of first-order generalizations imprecise.
arXiv Detail & Related papers (2024-03-05T06:10:21Z) - Information-Theoretic Trust Regions for Stochastic Gradient-Based
Optimization [17.79206971486723]
We show that arTuRO combines the fast convergence of adaptive moment-based optimization with the capabilities of SGD.
We approximate the diagonal elements of the Hessian from gradients, constructing a model of the expected Hessian over time using only first-order information.
We show that arTuRO combines the fast convergence of adaptive moment-based optimization with the capabilities of SGD.
arXiv Detail & Related papers (2023-10-31T16:08:38Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Eva: A General Vectorized Approximation Framework for Second-order
Optimization [16.647611352181574]
We present a memory- and time-efficient second-order algorithm named Eva with two novel techniques.
We derive an efficient update formula without explicitly computing the inverse of using the Sherman-Morrison formula.
Experiments show that Eva reduces the end-to-end training time up to 2.05x and 2.42x compared to first-order SGD and second-order algorithms.
arXiv Detail & Related papers (2023-08-04T03:51:38Z) - A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive
Coding Networks [65.34977803841007]
Predictive coding networks are neuroscience-inspired models with roots in both Bayesian statistics and neuroscience.
We show how by simply changing the temporal scheduling of the update rule for the synaptic weights leads to an algorithm that is much more efficient and stable than the original one.
arXiv Detail & Related papers (2022-11-16T00:11:04Z) - Learning to Optimize Quasi-Newton Methods [22.504971951262004]
This paper introduces a novel machine learning called LODO, which tries to online meta-learn the best preconditioner during optimization.
Unlike other L2O methods, LODO does not require any meta-training on a training task distribution.
We show that our gradient approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians.
arXiv Detail & Related papers (2022-10-11T03:47:14Z) - Self-Directed Online Machine Learning for Topology Optimization [58.920693413667216]
Self-directed Online Learning Optimization integrates Deep Neural Network (DNN) with Finite Element Method (FEM) calculations.
Our algorithm was tested by four types of problems including compliance minimization, fluid-structure optimization, heat transfer enhancement and truss optimization.
It reduced the computational time by 2 5 orders of magnitude compared with directly using methods, and outperformed all state-of-the-art algorithms tested in our experiments.
arXiv Detail & Related papers (2020-02-04T20:00:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.