Memory Augmented Optimizers for Deep Learning
- URL: http://arxiv.org/abs/2106.10708v1
- Date: Sun, 20 Jun 2021 14:58:08 GMT
- Title: Memory Augmented Optimizers for Deep Learning
- Authors: Paul-Aymeric McRae, Prasanna Parthasarathi, Mahmoud Assran, Sarath
Chandar
- Abstract summary: We propose a framework of memory-augmented gradient descents that retain a limited view of their gradient history in their internal memory.
We show that the proposed class of gradient descents with fixed-size memory converge under assumptions of strong convexity.
- Score: 10.541705775336657
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Popular approaches for minimizing loss in data-driven learning often involve
an abstraction or an explicit retention of the history of gradients for
efficient parameter updates. The aggregated history of gradients nudges the
parameter updates in the right direction even when the gradients at any given
step are not informative. Although the history of gradients summarized in
meta-parameters or explicitly stored in memory has been shown effective in
theory and practice, the question of whether $all$ or only a subset of the
gradients in the history are sufficient in deciding the parameter updates
remains unanswered. In this paper, we propose a framework of memory-augmented
gradient descent optimizers that retain a limited view of their gradient
history in their internal memory. Such optimizers scale well to large real-life
datasets, and our experiments show that the memory augmented extensions of
standard optimizers enjoy accelerated convergence and improved performance on a
majority of computer vision and language tasks that we considered.
Additionally, we prove that the proposed class of optimizers with fixed-size
memory converge under assumptions of strong convexity, regardless of which
gradients are selected or how they are linearly combined to form the update
step.
Related papers
- Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning [64.93848182403116]
Current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term.
We introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents.
Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks.
arXiv Detail & Related papers (2024-10-14T03:50:17Z) - An Effective Dynamic Gradient Calibration Method for Continual Learning [11.555822066922508]
Continual learning (CL) is a fundamental topic in machine learning, where the goal is to train a model with continuously incoming data and tasks.
Due to the memory limit, we cannot store all the historical data, and therefore confront the catastrophic forgetting'' problem.
We develop an effective algorithm to calibrate the gradient in each updating step of the model.
arXiv Detail & Related papers (2024-07-30T16:30:09Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - EMO: Episodic Memory Optimization for Few-Shot Meta-Learning [69.50380510879697]
episodic memory optimization for meta-learning, we call EMO, is inspired by the human ability to recall past learning experiences from the brain's memory.
EMO nudges parameter updates in the right direction, even when the gradients provided by a limited number of examples are uninformative.
EMO scales well with most few-shot classification benchmarks and improves the performance of optimization-based meta-learning methods.
arXiv Detail & Related papers (2023-06-08T13:39:08Z) - Tom: Leveraging trend of the observed gradients for faster convergence [0.0]
Tom is a novel variant of Adam that takes into account the trend observed for the gradients in the landscape in the loss traversed by the neural network.
Tom outperforms Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and has a faster convergence.
arXiv Detail & Related papers (2021-09-07T20:19:40Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.