Related papers: Should I try multiple optimizers when fine-tuning pre-trained Transformers for NLP tasks? Should I tune their hyperparameters?

Should I try multiple optimizers when fine-tuning pre-trained Transformers for NLP tasks? Should I tune their hyperparameters?

URL: http://arxiv.org/abs/2402.06948v1
Date: Sat, 10 Feb 2024 13:26:14 GMT
Title: Should I try multiple optimizers when fine-tuning pre-trained Transformers for NLP tasks? Should I tune their hyperparameters?
Authors: Nefeli Gkouti, Prodromos Malakasiotis, Stavros Toumpis, Ion Androutsopoulos
Abstract summary: Gradient Descent (SGD) is employed to select neural models for training. tuning just the learning rate is in most cases as good as tuning all the hyperparameters. We recommend picking any of the best-behaved adaptiveBounds (e.g., Adam) and recommending its learning rate.
Score: 14.349943044268471
License: http://creativecommons.org/licenses/by/4.0/
Abstract: NLP research has explored different neural model architectures and sizes, datasets, training objectives, and transfer learning techniques. However, the choice of optimizer during training has not been explored as extensively. Typically, some variant of Stochastic Gradient Descent (SGD) is employed, selected among numerous variants, using unclear criteria, often with minimal or no tuning of the optimizer's hyperparameters. Experimenting with five GLUE datasets, two models (DistilBERT and DistilRoBERTa), and seven popular optimizers (SGD, SGD with Momentum, Adam, AdaMax, Nadam, AdamW, and AdaBound), we find that when the hyperparameters of the optimizers are tuned, there is no substantial difference in test performance across the five more elaborate (adaptive) optimizers, despite differences in training loss. Furthermore, tuning just the learning rate is in most cases as good as tuning all the hyperparameters. Hence, we recommend picking any of the best-behaved adaptive optimizers (e.g., Adam) and tuning only its learning rate. When no hyperparameter can be tuned, SGD with Momentum is the best choice.

Related papers

Training neural networks faster with minimal tuning using pre-computed lists of hyperparameters for NAdamW [11.681640186200951]
We present a set of practical and performant hyper parameter lists for NAdamW. Our best NAdamW hyper parameter list performs well on AlgoPerf held-out workloads not used to construct it. It also outperforms basic learning rate/weight decay sweeps and an off-the-shelf Bayesian optimization tool when restricted to the same budget.
arXiv Detail & Related papers (2025-03-06T00:14:50Z)
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values. We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO) Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z)
Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis [51.14136878142034]
Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models. Existing methods for model adaptation usually update all model parameters, which is inefficient as it relies on high computational costs. In this paper, we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency.
arXiv Detail & Related papers (2024-03-03T08:25:04Z)
MADA: Meta-Adaptive Optimizers through hyper-gradient Descent [73.1383658672682]
We introduce Meta-Adaptives (MADA), a unified framework that can generalize several known convergences and dynamically learn the most suitable one during training. We empirically compare MADA to other populars on vision and language tasks, and find that MADA consistently outperforms Adam and other populars. We also propose AVGrad, a modification of AMS that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization.
arXiv Detail & Related papers (2024-01-17T00:16:46Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles. We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates. We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z)
The Role of Adaptive Optimizers for Honest Private Hyperparameter Selection [12.38071940409141]
We show that standard composition tools outperform more advanced techniques in many settings. We draw upon limiting behaviour of Adam in the DP setting to design a new and more efficient tool.
arXiv Detail & Related papers (2021-11-09T01:56:56Z)
Gravity Optimizer: a Kinematic Approach on Optimization in Deep Learning [0.0]
We introduce Gravity, another algorithm for gradient-based optimization. In this paper, we explain how our novel idea change parameters to reduce the deep learning model's loss. Also, we propose an alternative to moving average.
arXiv Detail & Related papers (2021-01-22T16:27:34Z)
How much progress have we made in neural network training? A New Evaluation Protocol for Benchmarking Optimizers [86.36020260204302]
We propose a new benchmarking protocol to evaluate both end-to-end efficiency and data-addition training efficiency. A human study is conducted to show that our evaluation protocol matches human tuning behavior better than the random search. We then apply the proposed benchmarking framework to 7s and various tasks, including computer vision, natural language processing, reinforcement learning, and graph mining.
arXiv Detail & Related papers (2020-10-19T21:46:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.