Multi-armed bandits for resource efficient, online optimization of
language model pre-training: the use case of dynamic masking
- URL: http://arxiv.org/abs/2203.13151v2
- Date: Tue, 30 May 2023 11:23:41 GMT
- Title: Multi-armed bandits for resource efficient, online optimization of
language model pre-training: the use case of dynamic masking
- Authors: I\~nigo Urteaga, Moulay-Za\"idane Dra\"idia, Tomer Lancewicki and
Shahram Khadivi
- Abstract summary: We evaluate a framework for resource efficient pre-training of Transformer-based language models (TLMs)
We propose a multi-armed bandit framework for the sequential selection of TLM pre-training hyper parameters.
GP-TS provides an interactive framework for efficient and optimized TLM pre-training.
- Score: 7.3618738570222915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We design and evaluate a Bayesian optimization framework for resource
efficient pre-training of Transformer-based language models (TLMs). TLM
pre-training requires high computational resources and introduces many
unresolved design choices, such as selecting its pre-training hyperparameters.
We propose a multi-armed bandit framework for the sequential selection of TLM
pre-training hyperparameters, aimed at optimizing language model performance,
in a resource efficient manner. We design a Thompson sampling algorithm, with a
surrogate Gaussian process reward model of the Masked Language Model (MLM)
pre-training objective, for its sequential minimization. Instead of MLM
pre-training with fixed masking probabilities, the proposed Gaussian
process-based Thompson sampling (GP-TS) accelerates pre-training by
sequentially selecting masking hyperparameters that improve performance. We
empirically demonstrate how GP-TS pre-trains language models efficiently, i.e.,
it achieves lower MLM loss in fewer epochs, across a variety of settings. In
addition, GP-TS pre-trained TLMs attain competitive downstream performance,
while avoiding expensive hyperparameter grid search. GP-TS provides an
interactive framework for efficient and optimized TLM pre-training that, by
circumventing costly hyperparameter selection, enables substantial
computational savings.
Related papers
- Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization process to enhance the multimodal reasoning capabilities of MLLMs.
We develop a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance.
Our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B.
arXiv Detail & Related papers (2024-11-15T18:59:27Z) - In-the-loop Hyper-Parameter Optimization for LLM-Based Automated Design of Heuristics [0.020482269513546456]
Large Language Models (LLMs) have shown great potential in automatically generating and optimizing (meta)heuristics.
This paper presents a novel hybrid approach, LLaMEA-HPO, which integrates an open source LLaMEA framework with a Hyper- Evolutionary Optimization (HPO) procedure in the loop.
arXiv Detail & Related papers (2024-10-07T14:04:31Z) - LoRTA: Low Rank Tensor Adaptation of Large Language Models [70.32218116940393]
Low Rank Adaptation (LoRA) is a popular Efficient Fine Tuning (PEFT) method that effectively adapts large pre-trained models for downstream tasks.
We propose a novel approach that employs a low rank tensor parametrization for model updates.
Our method is both efficient and effective for fine-tuning large language models, achieving a substantial reduction in the number of parameters while maintaining comparable performance.
arXiv Detail & Related papers (2024-10-05T06:59:50Z) - CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning [23.21367081440852]
Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning.
We introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs.
CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs.
arXiv Detail & Related papers (2024-10-03T05:51:26Z) - Optimization Hyper-parameter Laws for Large Language Models [56.322914260197734]
We present Opt-Laws, a framework that captures the relationship between hyper- parameters and training outcomes.
Our validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss.
This approach significantly reduces computational costs while enhancing overall model performance.
arXiv Detail & Related papers (2024-09-07T09:37:19Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Value Augmented Sampling for Language Model Alignment and Personalization [39.070662999014836]
We present a new framework for reward optimization, Value Augmented Sampling (VAS)
VAS solves for the optimal reward-maximizing policy without co-training the policy and the value function.
Our algorithm unlocks the new capability of composing several rewards and controlling the extent of each one during deployment time.
arXiv Detail & Related papers (2024-05-10T17:59:04Z) - Hyperparameter Optimization for Large Language Model Instruction-Tuning [6.743825167463901]
We study the whole pipeline of performing fine-tuning and validation on a pre-trained LLM as a blackbox.
We efficiently explore the space of hyper parameters with the nomad algorithm, achieving a boost in performance and human alignment of the tuned model.
arXiv Detail & Related papers (2023-12-01T22:03:12Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - Multi-level Training and Bayesian Optimization for Economical
Hyperparameter Optimization [12.92634461859467]
In this paper, we develop an effective approach to reducing the total amount of required training time for Hyperparameter Optimization.
We propose a truncated additive Gaussian process model to calibrate approximate performance measurements generated by light training.
Based on the model, a sequential model-based algorithm is developed to generate the performance profile of the configuration space as well as find optimal ones.
arXiv Detail & Related papers (2020-07-20T09:03:02Z) - Semi-Autoregressive Training Improves Mask-Predict Decoding [119.8412758943192]
We introduce a new training method for conditional masked language models, SMART, which mimics the semi-autoregressive behavior of mask-predict.
Models trained with SMART produce higher-quality translations when using mask-predict decoding, effectively closing the remaining performance gap with fully autoregressive models.
arXiv Detail & Related papers (2020-01-23T19:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.