Learning to Optimize Quasi-Newton Methods
- URL: http://arxiv.org/abs/2210.06171v2
- Date: Mon, 11 Sep 2023 07:27:05 GMT
- Title: Learning to Optimize Quasi-Newton Methods
- Authors: Isaac Liao, Rumen R. Dangovski, Jakob N. Foerster, Marin
Solja\v{c}i\'c
- Abstract summary: This paper introduces a novel machine learning called LODO, which tries to online meta-learn the best preconditioner during optimization.
Unlike other L2O methods, LODO does not require any meta-training on a training task distribution.
We show that our gradient approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians.
- Score: 22.504971951262004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fast gradient-based optimization algorithms have become increasingly
essential for the computationally efficient training of machine learning
models. One technique is to multiply the gradient by a preconditioner matrix to
produce a step, but it is unclear what the best preconditioner matrix is. This
paper introduces a novel machine learning optimizer called LODO, which tries to
online meta-learn the best preconditioner during optimization. Specifically,
our optimizer merges Learning to Optimize (L2O) techniques with quasi-Newton
methods to learn preconditioners parameterized as neural networks; they are
more flexible than preconditioners in other quasi-Newton methods. Unlike other
L2O methods, LODO does not require any meta-training on a training task
distribution, and instead learns to optimize on the fly while optimizing on the
test task, adapting to the local characteristics of the loss landscape while
traversing it. Theoretically, we show that our optimizer approximates the
inverse Hessian in noisy loss landscapes and is capable of representing a wide
range of inverse Hessians. We experimentally verify that our algorithm can
optimize in noisy settings, and show that simpler alternatives for representing
the inverse Hessians worsen performance. Lastly, we use our optimizer to train
a semi-realistic deep neural network with 95k parameters at speeds comparable
to those of standard neural network optimizers.
Related papers
- ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Transformer-Based Learned Optimization [37.84626515073609]
We propose a new approach to learned optimization where we represent the computation's update step using a neural network.
Our innovation is a new neural network architecture inspired by the classic BFGS algorithm.
We demonstrate the advantages of our approach on a benchmark composed of objective functions traditionally used for the evaluation of optimization algorithms.
arXiv Detail & Related papers (2022-12-02T09:47:08Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z) - Faster Optimization on Sparse Graphs via Neural Reparametrization [15.275428333269453]
We show that a graph neural network can implement an efficient Quasi-Newton method that can speed up optimization by a factor of 10-100x.
We show the application of our method on scientifically relevant problems including heat diffusion, synchronization and persistent homology.
arXiv Detail & Related papers (2022-05-26T20:52:18Z) - Gradient Descent, Stochastic Optimization, and Other Tales [8.034728173797953]
This tutorial doesn't shy away from addressing both the formal and informal aspects of gradient descent and optimization methods.
Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize machine learning tasks.
In deep neural networks, the gradient followed by a single sample or a batch of samples is employed to save computational resources and escape from saddle points.
arXiv Detail & Related papers (2022-05-02T12:06:53Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Reverse engineering learned optimizers reveals known and novel
mechanisms [50.50540910474342]
Learneds are algorithms that can themselves be trained to solve optimization problems.
Our results help elucidate the previously murky understanding of how learneds work, and establish tools for interpreting future learneds.
arXiv Detail & Related papers (2020-11-04T07:12:43Z) - Tasks, stability, architecture, and compute: Training more effective
learned optimizers, and using them to train themselves [53.37905268850274]
We introduce a new, hierarchical, neural network parameterized, hierarchical with access to additional features such as validation loss to enable automatic regularization.
Most learneds have been trained on only a single task, or a small number of tasks.
We train ours on thousands of tasks, making use of orders of magnitude more compute, resulting in generalizes that perform better to unseen tasks.
arXiv Detail & Related papers (2020-09-23T16:35:09Z) - A Primer on Zeroth-Order Optimization in Signal Processing and Machine
Learning [95.85269649177336]
ZO optimization iteratively performs three major steps: gradient estimation, descent direction, and solution update.
We demonstrate promising applications of ZO optimization, such as evaluating and generating explanations from black-box deep learning models, and efficient online sensor management.
arXiv Detail & Related papers (2020-06-11T06:50:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.