Practical tradeoffs between memory, compute, and performance in learned
optimizers
- URL: http://arxiv.org/abs/2203.11860v1
- Date: Tue, 22 Mar 2022 16:36:36 GMT
- Title: Practical tradeoffs between memory, compute, and performance in learned
optimizers
- Authors: Luke Metz, C. Daniel Freeman, James Harrison, Niru Maheswaranathan,
Jascha Sohl-Dickstein
- Abstract summary: We identify and quantify the memory, compute, and performance trade-offs for many learned and hand-designeds features.
We leverage our analysis to construct a learned is both faster and more efficient than previous work.
- Score: 46.04132441790654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimization plays a costly and crucial role in developing machine learning
systems. In learned optimizers, the few hyperparameters of commonly used
hand-designed optimizers, e.g. Adam or SGD, are replaced with flexible
parametric functions. The parameters of these functions are then optimized so
that the resulting learned optimizer minimizes a target loss on a chosen class
of models. Learned optimizers can both reduce the number of required training
steps and improve the final test loss. However, they can be expensive to train,
and once trained can be expensive to use due to computational and memory
overhead for the optimizer itself. In this work, we identify and quantify the
design features governing the memory, compute, and performance trade-offs for
many learned and hand-designed optimizers. We further leverage our analysis to
construct a learned optimizer that is both faster and more memory efficient
than previous work.
Related papers
- AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - CoRe Optimizer: An All-in-One Solution for Machine Learning [0.0]
Continuously resilient convergence (CoRe) shown superior performance compared to other state-of-the-art first-order gradient-based convergence algorithms.
CoRe yields best or competitive performance in every investigated application.
arXiv Detail & Related papers (2023-07-28T16:48:42Z) - Judging Adam: Studying the Performance of Optimization Methods on ML4SE
Tasks [2.8961929092154697]
We test the performance of variouss on deep learning models for source code.
We find that the choice of anahead can have a significant impact on the model quality.
We suggest that the ML4SE community should consider using RAdam instead Adam as the default for code-related deep learning tasks.
arXiv Detail & Related papers (2023-03-06T22:49:20Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z) - Training Learned Optimizers with Randomly Initialized Learned Optimizers [49.67678615506608]
We show that a population of randomly learneds can be used to train themselves from scratch in an online fashion.
A form of population based training is used to orchestrate this self-training.
We believe feedback loops of this type will be important and powerful in the future of machine learning.
arXiv Detail & Related papers (2021-01-14T19:07:17Z) - Reverse engineering learned optimizers reveals known and novel
mechanisms [50.50540910474342]
Learneds are algorithms that can themselves be trained to solve optimization problems.
Our results help elucidate the previously murky understanding of how learneds work, and establish tools for interpreting future learneds.
arXiv Detail & Related papers (2020-11-04T07:12:43Z) - Tasks, stability, architecture, and compute: Training more effective
learned optimizers, and using them to train themselves [53.37905268850274]
We introduce a new, hierarchical, neural network parameterized, hierarchical with access to additional features such as validation loss to enable automatic regularization.
Most learneds have been trained on only a single task, or a small number of tasks.
We train ours on thousands of tasks, making use of orders of magnitude more compute, resulting in generalizes that perform better to unseen tasks.
arXiv Detail & Related papers (2020-09-23T16:35:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.