MADA: Meta-Adaptive Optimizers through hyper-gradient Descent
- URL: http://arxiv.org/abs/2401.08893v3
- Date: Mon, 17 Jun 2024 12:52:24 GMT
- Title: MADA: Meta-Adaptive Optimizers through hyper-gradient Descent
- Authors: Kaan Ozkara, Can Karakus, Parameswaran Raman, Mingyi Hong, Shoham Sabach, Branislav Kveton, Volkan Cevher,
- Abstract summary: We introduce Meta-Adaptives (MADA), a unified framework that can generalize several known convergences and dynamically learn the most suitable one during training.
We empirically compare MADA to other populars on vision and language tasks, and find that MADA consistently outperforms Adam and other populars.
We also propose AVGrad, a modification of AMS that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization.
- Score: 73.1383658672682
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Following the introduction of Adam, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce Meta-Adaptive Optimizers (MADA), a unified optimizer framework that can generalize several known optimizers and dynamically learn the most suitable one during training. The key idea in MADA is to parameterize the space of optimizers and dynamically search through it using hyper-gradient descent during training. We empirically compare MADA to other popular optimizers on vision and language tasks, and find that MADA consistently outperforms Adam and other popular optimizers, and is robust against sub-optimally tuned hyper-parameters. MADA achieves a greater validation performance improvement over Adam compared to other popular optimizers during GPT-2 training and fine-tuning. We also propose AVGrad, a modification of AMSGrad that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization. Finally, we provide a convergence analysis to show that parameterized interpolations of optimizers can improve their error bounds (up to constants), hinting at an advantage for meta-optimizers.
Related papers
- Should I try multiple optimizers when fine-tuning pre-trained
Transformers for NLP tasks? Should I tune their hyperparameters? [14.349943044268471]
Gradient Descent (SGD) is employed to select neural models for training.
tuning just the learning rate is in most cases as good as tuning all the hyperparameters.
We recommend picking any of the best-behaved adaptiveBounds (e.g., Adam) and recommending its learning rate.
arXiv Detail & Related papers (2024-02-10T13:26:14Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - CoRe Optimizer: An All-in-One Solution for Machine Learning [0.0]
Continuously resilient convergence (CoRe) shown superior performance compared to other state-of-the-art first-order gradient-based convergence algorithms.
CoRe yields best or competitive performance in every investigated application.
arXiv Detail & Related papers (2023-07-28T16:48:42Z) - Bidirectional Looking with A Novel Double Exponential Moving Average to
Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework.
We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z) - Judging Adam: Studying the Performance of Optimization Methods on ML4SE
Tasks [2.8961929092154697]
We test the performance of variouss on deep learning models for source code.
We find that the choice of anahead can have a significant impact on the model quality.
We suggest that the ML4SE community should consider using RAdam instead Adam as the default for code-related deep learning tasks.
arXiv Detail & Related papers (2023-03-06T22:49:20Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z) - Optimizer Amalgamation [124.33523126363728]
We are motivated to study a new problem named Amalgamation: how can we best combine a pool of "teacher" amalgamations into a single "student" that can have stronger problem-specific performance?
First, we define three differentiable mechanisms to amalgamate a pool of analyticals by gradient descent.
In order to reduce variance of the process, we also explore methods to stabilize the process by perturbing the target.
arXiv Detail & Related papers (2022-03-12T16:07:57Z) - Optimizing Large-Scale Hyperparameters via Automated Learning Algorithm [97.66038345864095]
We propose a new hyperparameter optimization method with zeroth-order hyper-gradients (HOZOG)
Specifically, we first formulate hyperparameter optimization as an A-based constrained optimization problem.
Then, we use the average zeroth-order hyper-gradients to update hyper parameters.
arXiv Detail & Related papers (2021-02-17T21:03:05Z) - Mixing ADAM and SGD: a Combined Optimization Method [0.9569316316728905]
We propose a new type of optimization method called MAS (Mixing ADAM and SGD)
Rather than trying to improve SGD or ADAM we exploit both at the same time by taking the best of both.
We have conducted several experiments on images and text document classification, using various CNNs, and we demonstrated by experiments that the proposed MAS produces better performance than the single SGD or ADAMs.
arXiv Detail & Related papers (2020-11-16T15:48:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.