AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for
Preconditioning Matrix
- URL: http://arxiv.org/abs/2312.01658v1
- Date: Mon, 4 Dec 2023 06:20:14 GMT
- Title: AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for
Preconditioning Matrix
- Authors: Yun Yue, Zhiling Ye, Jiadi Jiang, Yongchao Liu, Ke Zhang
- Abstract summary: We propose a novel approach to designing the preconditioning matrix by utilizing the gradient difference between two successive steps as the diagonal elements.
We evaluate AGD on public generalization of Natural Language Processing (NLP), Computer Vision (CV), and Recommendation Systems (RecSys)
Our experimental results demonstrate that AGD outperforms the state-of-the-art (SOTA) techniques, achieving highly competitive or significantly better predictive performance.
- Score: 9.629238108795013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive optimizers, such as Adam, have achieved remarkable success in deep
learning. A key component of these optimizers is the so-called preconditioning
matrix, providing enhanced gradient information and regulating the step size of
each gradient direction. In this paper, we propose a novel approach to
designing the preconditioning matrix by utilizing the gradient difference
between two successive steps as the diagonal elements. These diagonal elements
are closely related to the Hessian and can be perceived as an approximation of
the inner product between the Hessian row vectors and difference of the
adjacent parameter vectors. Additionally, we introduce an auto-switching
function that enables the preconditioning matrix to switch dynamically between
Stochastic Gradient Descent (SGD) and the adaptive optimizer. Based on these
two techniques, we develop a new optimizer named AGD that enhances the
generalization performance. We evaluate AGD on public datasets of Natural
Language Processing (NLP), Computer Vision (CV), and Recommendation Systems
(RecSys). Our experimental results demonstrate that AGD outperforms the
state-of-the-art (SOTA) optimizers, achieving highly competitive or
significantly better predictive performance. Furthermore, we analyze how AGD is
able to switch automatically between SGD and the adaptive optimizer and its
actual effects on various scenarios. The code is available at
https://github.com/intelligent-machine-learning/dlrover/tree/master/atorch/atorch/optimizers.
Related papers
- Two Optimizers Are Better Than One: LLM Catalyst Empowers Gradient-Based Optimization for Prompt Tuning [69.95292905263393]
We show that gradient-based optimization and large language models (MsLL) are complementary to each other, suggesting a collaborative optimization approach.
Our code is released at https://www.guozix.com/guozix/LLM-catalyst.
arXiv Detail & Related papers (2024-05-30T06:24:14Z) - SGD with Partial Hessian for Deep Neural Networks Optimization [18.78728272603732]
We propose a compound, which is a combination of a second-order with a precise partial Hessian matrix for updating channel-wise parameters and the first-order gradient descent (SGD) algorithms for updating the other parameters.
Compared with first-orders, it adopts a certain amount of information from the Hessian matrix to assist optimization, while compared with the existing second-order generalizations, it keeps the good performance of first-order generalizations imprecise.
arXiv Detail & Related papers (2024-03-05T06:10:21Z) - Curvature-Informed SGD via General Purpose Lie-Group Preconditioners [6.760212042305871]
We present a novel approach to accelerate gradient descent (SGD) by utilizing curvature information.
Our approach involves two preconditioners: a matrix-free preconditioner and a low-rank approximation preconditioner.
We demonstrate that Preconditioned SGD (PSGD) outperforms SoTA on Vision, NLP, and RL tasks.
arXiv Detail & Related papers (2024-02-07T03:18:00Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - Learning to Optimize Quasi-Newton Methods [22.504971951262004]
This paper introduces a novel machine learning called LODO, which tries to online meta-learn the best preconditioner during optimization.
Unlike other L2O methods, LODO does not require any meta-training on a training task distribution.
We show that our gradient approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians.
arXiv Detail & Related papers (2022-10-11T03:47:14Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z) - Bilevel Optimization: Convergence Analysis and Enhanced Design [63.64636047748605]
Bilevel optimization is a tool for many machine learning problems.
We propose a novel stoc-efficientgradient estimator named stoc-BiO.
arXiv Detail & Related papers (2020-10-15T18:09:48Z) - A Primer on Zeroth-Order Optimization in Signal Processing and Machine
Learning [95.85269649177336]
ZO optimization iteratively performs three major steps: gradient estimation, descent direction, and solution update.
We demonstrate promising applications of ZO optimization, such as evaluating and generating explanations from black-box deep learning models, and efficient online sensor management.
arXiv Detail & Related papers (2020-06-11T06:50:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.