Related papers: Narrowing the Focus: Learned Optimizers for Pretrained Models

Narrowing the Focus: Learned Optimizers for Pretrained Models

URL: http://arxiv.org/abs/2408.09310v3
Date: Sat, 5 Oct 2024 01:26:40 GMT
Title: Narrowing the Focus: Learned Optimizers for Pretrained Models
Authors: Gus Kristiansen, Mark Sandler, Andrey Zhmoginov, Nolan Miller, Anirudh Goyal, Jihwan Lee, Max Vladymyrov,
Abstract summary: We propose a novel technique that learns a layer-specific linear combination of update directions provided by a set of base work tasks. When evaluated on an image, this specialized significantly outperforms both traditional off-the-shelf methods such as Adam, as well existing general learneds.
Score: 24.685918556547055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In modern deep learning, the models are learned by applying gradient updates using an optimizer, which transforms the updates based on various statistics. Optimizers are often hand-designed and tuning their hyperparameters is a big part of the training process. Learned optimizers have shown some initial promise, but are generally unsuccessful as a general optimization mechanism applicable to every problem. In this work we explore a different direction: instead of learning general optimizers, we instead specialize them to a specific training environment. We propose a novel optimizer technique that learns a layer-specific linear combination of update directions provided by a set of base optimizers, effectively adapting its strategy to the specific model and dataset. When evaluated on image classification tasks, this specialized optimizer significantly outperforms both traditional off-the-shelf methods such as Adam, as well as existing general learned optimizers. Moreover, it demonstrates robust generalization with respect to model initialization, evaluating on unseen datasets, and training durations beyond its meta-training horizon.

Related papers

Learning Versatile Optimizers on a Compute Diet [20.69804303768643]
Key elements in learned architectures and meta-training procedures can lead to strong meta-generalization. We propose evaluation metrics to reliably assess quantitative performance of an at scale on a set of evaluation tasks. Our proposed approach, Celo, makes a significant leap in improving the meta-generalization performance of learneds.
arXiv Detail & Related papers (2025-01-22T06:10:27Z)
PROFIT: A Specialized Optimizer for Deep Fine Tuning [9.082267858686933]
We present PROFIT (Prolly Restricted For Iterative Training), one of the firsts specifically designed for incrementally fine-tuning converged models on new tasks or datasets. By employing a simple temporalization process, PROFIT outperforms traditional fine-tuning methods across various tasks. PROFIT is encapsulated within the logic, making it easily integrated into any training pipeline with minimal engineering effort.
arXiv Detail & Related papers (2024-12-02T19:37:34Z)
Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate [105.86576388991713]
We introduce a normalized gradient difference (NGDiff) algorithm, enabling us to have better control over the trade-off between the objectives. We provide a theoretical analysis and empirically demonstrate the superior performance of NGDiff among state-of-the-art unlearning methods on the TOFU and MUSE datasets.
arXiv Detail & Related papers (2024-10-29T14:41:44Z)
Multiplicative update rules for accelerating deep learning training and increasing robustness [69.90473612073767]
We propose an optimization framework that fits to a wide range of machine learning algorithms and enables one to apply alternative update rules. We claim that the proposed framework accelerates training, while leading to more robust models in contrast to traditionally used additive update rule.
arXiv Detail & Related papers (2023-07-14T06:44:43Z)
Meta-Learning Parameterized First-Order Optimizers using Differentiable Convex Optimization [13.043909705693249]
We propose a meta-learning framework in which the inner loop optimization step involves solving a differentiable convex optimization (DCO) We illustrate the theoretical appeal of this approach by showing that it enables one-step optimization of a family of linear least squares problems.
arXiv Detail & Related papers (2023-03-29T18:17:41Z)
Learning to Optimize with Dynamic Mode Decomposition [0.0]
We show how to utilize the dynamic mode decomposition method for extracting informative features about optimization dynamics. We show that our learned generalizes much better to unseen optimization problems in short.
arXiv Detail & Related papers (2022-11-29T14:55:59Z)
VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles. We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates. We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z)
A Closer Look at Learned Optimization: Stability, Robustness, and Inductive Biases [44.01339030872185]
Blackbox learneds often struggle with stability and generalization when applied to tasks unlike those in their meta-training set. We investigate the inductive biases and stability properties of optimization algorithms, and apply the resulting insights to designing inductive biases for blackboxs. We learn to a variety of neural network training tasks, where it outperforms the current state of the art learned.
arXiv Detail & Related papers (2022-09-22T17:47:21Z)
Pre-training helps Bayesian optimization too [49.28382118032923]
We seek an alternative practice for setting functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. Our results show that our method is able to locate good hyper parameters at least 3 times more efficiently than the best competing methods.
arXiv Detail & Related papers (2022-07-07T04:42:54Z)
Adaptive Optimization with Examplewise Gradients [23.504973357538418]
We propose a new, more general approach to the design of gradient-based optimization methods for machine learning. In this new framework, iterations assume access to a batch of estimates per parameter, rather than a single estimate. This better reflects the information that is actually available in typical machine learning setups.
arXiv Detail & Related papers (2021-11-30T23:37:01Z)
Training Learned Optimizers with Randomly Initialized Learned Optimizers [49.67678615506608]
We show that a population of randomly learneds can be used to train themselves from scratch in an online fashion. A form of population based training is used to orchestrate this self-training. We believe feedback loops of this type will be important and powerful in the future of machine learning.
arXiv Detail & Related papers (2021-01-14T19:07:17Z)
Reverse engineering learned optimizers reveals known and novel mechanisms [50.50540910474342]
Learneds are algorithms that can themselves be trained to solve optimization problems. Our results help elucidate the previously murky understanding of how learneds work, and establish tools for interpreting future learneds.
arXiv Detail & Related papers (2020-11-04T07:12:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.