Memory Augmented Optimizers for Deep Learning
- URL: http://arxiv.org/abs/2106.10708v1
- Date: Sun, 20 Jun 2021 14:58:08 GMT
- Title: Memory Augmented Optimizers for Deep Learning
- Authors: Paul-Aymeric McRae, Prasanna Parthasarathi, Mahmoud Assran, Sarath
Chandar
- Abstract summary: We propose a framework of memory-augmented gradient descents that retain a limited view of their gradient history in their internal memory.
We show that the proposed class of gradient descents with fixed-size memory converge under assumptions of strong convexity.
- Score: 10.541705775336657
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Popular approaches for minimizing loss in data-driven learning often involve
an abstraction or an explicit retention of the history of gradients for
efficient parameter updates. The aggregated history of gradients nudges the
parameter updates in the right direction even when the gradients at any given
step are not informative. Although the history of gradients summarized in
meta-parameters or explicitly stored in memory has been shown effective in
theory and practice, the question of whether $all$ or only a subset of the
gradients in the history are sufficient in deciding the parameter updates
remains unanswered. In this paper, we propose a framework of memory-augmented
gradient descent optimizers that retain a limited view of their gradient
history in their internal memory. Such optimizers scale well to large real-life
datasets, and our experiments show that the memory augmented extensions of
standard optimizers enjoy accelerated convergence and improved performance on a
majority of computer vision and language tasks that we considered.
Additionally, we prove that the proposed class of optimizers with fixed-size
memory converge under assumptions of strong convexity, regardless of which
gradients are selected or how they are linearly combined to form the update
step.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - EMO: Episodic Memory Optimization for Few-Shot Meta-Learning [69.50380510879697]
episodic memory optimization for meta-learning, we call EMO, is inspired by the human ability to recall past learning experiences from the brain's memory.
EMO nudges parameter updates in the right direction, even when the gradients provided by a limited number of examples are uninformative.
EMO scales well with most few-shot classification benchmarks and improves the performance of optimization-based meta-learning methods.
arXiv Detail & Related papers (2023-06-08T13:39:08Z) - Continuous-Time Meta-Learning with Forward Mode Differentiation [65.26189016950343]
We introduce Continuous Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field.
Treating the learning process as an ODE offers the notable advantage that the length of the trajectory is now continuous.
We show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.
arXiv Detail & Related papers (2022-03-02T22:35:58Z) - Dataset Knowledge Transfer for Class-Incremental Learning without Memory [12.569286058146343]
We tackle class-incremental learning without memory by adapting prediction bias correction.
It is proposed when a memory is allowed and cannot be directly used without memory, since samples of past classes are required.
We introduce a two-step learning process which allows the transfer of bias correction parameters between reference and target datasets.
arXiv Detail & Related papers (2021-10-16T00:33:33Z) - Tom: Leveraging trend of the observed gradients for faster convergence [0.0]
Tom is a novel variant of Adam that takes into account the trend observed for the gradients in the landscape in the loss traversed by the neural network.
Tom outperforms Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and has a faster convergence.
arXiv Detail & Related papers (2021-09-07T20:19:40Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.