Should I try multiple optimizers when fine-tuning pre-trained
Transformers for NLP tasks? Should I tune their hyperparameters?
- URL: http://arxiv.org/abs/2402.06948v1
- Date: Sat, 10 Feb 2024 13:26:14 GMT
- Title: Should I try multiple optimizers when fine-tuning pre-trained
Transformers for NLP tasks? Should I tune their hyperparameters?
- Authors: Nefeli Gkouti, Prodromos Malakasiotis, Stavros Toumpis, Ion
Androutsopoulos
- Abstract summary: Gradient Descent (SGD) is employed to select neural models for training.
tuning just the learning rate is in most cases as good as tuning all the hyperparameters.
We recommend picking any of the best-behaved adaptiveBounds (e.g., Adam) and recommending its learning rate.
- Score: 14.349943044268471
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: NLP research has explored different neural model architectures and sizes,
datasets, training objectives, and transfer learning techniques. However, the
choice of optimizer during training has not been explored as extensively.
Typically, some variant of Stochastic Gradient Descent (SGD) is employed,
selected among numerous variants, using unclear criteria, often with minimal or
no tuning of the optimizer's hyperparameters. Experimenting with five GLUE
datasets, two models (DistilBERT and DistilRoBERTa), and seven popular
optimizers (SGD, SGD with Momentum, Adam, AdaMax, Nadam, AdamW, and AdaBound),
we find that when the hyperparameters of the optimizers are tuned, there is no
substantial difference in test performance across the five more elaborate
(adaptive) optimizers, despite differences in training loss. Furthermore,
tuning just the learning rate is in most cases as good as tuning all the
hyperparameters. Hence, we recommend picking any of the best-behaved adaptive
optimizers (e.g., Adam) and tuning only its learning rate. When no
hyperparameter can be tuned, SGD with Momentum is the best choice.
Related papers
- Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis [51.14136878142034]
Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models.
Existing methods for model adaptation usually update all model parameters, which is inefficient as it relies on high computational costs.
In this paper, we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency.
arXiv Detail & Related papers (2024-03-03T08:25:04Z) - MADA: Meta-Adaptive Optimizers through hyper-gradient Descent [73.1383658672682]
We introduce Meta-Adaptives (MADA), a unified framework that can generalize several known convergences and dynamically learn the most suitable one during training.
We empirically compare MADA to other populars on vision and language tasks, and find that MADA consistently outperforms Adam and other populars.
We also propose AVGrad, a modification of AMS that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization.
arXiv Detail & Related papers (2024-01-17T00:16:46Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z) - The Role of Adaptive Optimizers for Honest Private Hyperparameter
Selection [12.38071940409141]
We show that standard composition tools outperform more advanced techniques in many settings.
We draw upon limiting behaviour of Adam in the DP setting to design a new and more efficient tool.
arXiv Detail & Related papers (2021-11-09T01:56:56Z) - Gravity Optimizer: a Kinematic Approach on Optimization in Deep Learning [0.0]
We introduce Gravity, another algorithm for gradient-based optimization.
In this paper, we explain how our novel idea change parameters to reduce the deep learning model's loss.
Also, we propose an alternative to moving average.
arXiv Detail & Related papers (2021-01-22T16:27:34Z) - How much progress have we made in neural network training? A New
Evaluation Protocol for Benchmarking Optimizers [86.36020260204302]
We propose a new benchmarking protocol to evaluate both end-to-end efficiency and data-addition training efficiency.
A human study is conducted to show that our evaluation protocol matches human tuning behavior better than the random search.
We then apply the proposed benchmarking framework to 7s and various tasks, including computer vision, natural language processing, reinforcement learning, and graph mining.
arXiv Detail & Related papers (2020-10-19T21:46:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.