Related papers: Learning Versatile Optimizers on a Compute Diet

Learning Versatile Optimizers on a Compute Diet

URL: http://arxiv.org/abs/2501.12670v1
Date: Wed, 22 Jan 2025 06:10:27 GMT
Title: Learning Versatile Optimizers on a Compute Diet
Authors: Abhinav Moudgil, Boris Knyazev, Guillaume Lajoie, Eugene Belilovsky,
Abstract summary: Key elements in learned architectures and meta-training procedures can lead to strong meta-generalization.<n>We propose evaluation metrics to reliably assess quantitative performance of an at scale on a set of evaluation tasks.<n>Our proposed approach, Celo, makes a significant leap in improving the meta-generalization performance of learneds.
Score: 20.69804303768643
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learned optimization has emerged as a promising alternative to hand-crafted optimizers, with the potential to discover stronger learned update rules that enable faster, hyperparameter-free training of neural networks. A critical element for practically useful learned optimizers, that can be used off-the-shelf after meta-training, is strong meta-generalization: the ability to apply the optimizers to new tasks. Recent state-of-the-art work in learned optimizers, VeLO (Metz et al., 2022), requires a large number of highly diverse meta-training tasks along with massive computational resources, 4000 TPU months, to achieve meta-generalization. This makes further improvements to such learned optimizers impractical. In this work, we identify several key elements in learned optimizer architectures and meta-training procedures that can lead to strong meta-generalization. We also propose evaluation metrics to reliably assess quantitative performance of an optimizer at scale on a set of evaluation tasks. Our proposed approach, Celo, makes a significant leap in improving the meta-generalization performance of learned optimizers and also outperforms tuned state-of-the-art optimizers on a diverse set of out-of-distribution tasks, despite being meta-trained for just 24 GPU hours.

Related papers

Narrowing the Focus: Learned Optimizers for Pretrained Models [24.685918556547055]
We propose a novel technique that learns a layer-specific linear combination of update directions provided by a set of base work tasks. When evaluated on an image, this specialized significantly outperforms both traditional off-the-shelf methods such as Adam, as well existing general learneds.
arXiv Detail & Related papers (2024-08-17T23:55:19Z)
HUB: Guiding Learned Optimizers with Continuous Prompt Tuning [45.662334160254176]
Learneds are a crucial component of meta-learning. Recent advancements in scalable learneds have demonstrated their superior performance over hand-designeds in various tasks. We propose a hybrid-update-based (HUB) optimization strategy to tackle the issue of generalization in scalable learneds.
arXiv Detail & Related papers (2023-05-26T11:08:20Z)
Learning to Generalize Provably in Learning to Optimize [185.71326306329678]
Learning to optimize (L2O) has gained increasing popularity, which automates the design of optimizees by data-driven approaches. Current L2O methods often suffer from poor generalization performance in at least two folds. We propose to incorporate these two metrics as flatness-aware regularizers into the L2O framework.
arXiv Detail & Related papers (2023-02-22T01:17:31Z)
Learning to Optimize for Reinforcement Learning [58.01132862590378]
Reinforcement learning (RL) is essentially different from supervised learning, and in practice, these learneds do not work well even in simple RL tasks. Agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training. We show that, although only trained in toy tasks, our learned can generalize unseen complex tasks in Brax.
arXiv Detail & Related papers (2023-02-03T00:11:02Z)
VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles. We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates. We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z)
A Closer Look at Learned Optimization: Stability, Robustness, and Inductive Biases [44.01339030872185]
Blackbox learneds often struggle with stability and generalization when applied to tasks unlike those in their meta-training set. We investigate the inductive biases and stability properties of optimization algorithms, and apply the resulting insights to designing inductive biases for blackboxs. We learn to a variety of neural network training tasks, where it outperforms the current state of the art learned.
arXiv Detail & Related papers (2022-09-22T17:47:21Z)
Training Learned Optimizers with Randomly Initialized Learned Optimizers [49.67678615506608]
We show that a population of randomly learneds can be used to train themselves from scratch in an online fashion. A form of population based training is used to orchestrate this self-training. We believe feedback loops of this type will be important and powerful in the future of machine learning.
arXiv Detail & Related papers (2021-01-14T19:07:17Z)
Reverse engineering learned optimizers reveals known and novel mechanisms [50.50540910474342]
Learneds are algorithms that can themselves be trained to solve optimization problems. Our results help elucidate the previously murky understanding of how learneds work, and establish tools for interpreting future learneds.
arXiv Detail & Related papers (2020-11-04T07:12:43Z)
Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves [53.37905268850274]
We introduce a new, hierarchical, neural network parameterized, hierarchical with access to additional features such as validation loss to enable automatic regularization. Most learneds have been trained on only a single task, or a small number of tasks. We train ours on thousands of tasks, making use of orders of magnitude more compute, resulting in generalizes that perform better to unseen tasks.
arXiv Detail & Related papers (2020-09-23T16:35:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.