Related papers: LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

URL: http://arxiv.org/abs/2410.01735v1
Date: Wed, 2 Oct 2024 16:46:38 GMT
Title: LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits
Authors: Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal,
Abstract summary: We introduce LASeR (Learning to Adaptively Select Rewards), which iteratively trains LLMs using multiple Reward Models (RMs) Our results demonstrate that LASeR can boost iterative LLM optimization by optimizing for multiple RMs. We also verify the presence of conflicting preferences from multiple RMs that can be mitigated using LASeR.
Score: 56.93583799109029
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward Models (RMs) play a crucial role in aligning LLMs with human preferences, enhancing their performance by ranking outputs during inference or iterative training. However, the degree to which an RM generalizes to new tasks is often not known a priori (e.g. some RMs may excel at scoring creative writing vs. math reasoning). Therefore, using only one fixed RM while training LLMs can be suboptimal. Moreover, optimizing LLMs with multiple RMs simultaneously can be prohibitively computationally-intensive and challenging due to conflicting signals from different RMs, potentially degrading performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which iteratively trains LLMs using multiple RMs, selecting and utilizing the most well-suited RM for each instance to rank outputs and generate preference data, framed as a multi-armed bandit problem. Our results on commonsense and math reasoning tasks demonstrate that LASeR can boost iterative LLM optimization by optimizing for multiple RMs, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over training with ensemble RM scores while also showing superior training efficiency (e.g., a 2x speedup). Moreover, on WildChat, a benchmark of instruction-following prompts, we find that using Llama-3-8B LASeR leads to a 71.45% AlpacaEval win rate over sequentially optimizing multiple RMs. Extending to long-context generation tasks, we find that on Llama-3-8B, LASeR achieves an average improvement of 2.64 F1 and 2.42 F1 on single- and multi-document QA over random RM selection when used with best-of-n sampling. LASeR is robust to noisy rewards and generalizes to multiple settings. Finally, LASeR's RM selection changes depending on the underlying task or instance and we verify the presence of conflicting preferences from multiple RMs that can be mitigated using LASeR.

Related papers

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks [110.20297293596005]
Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. Existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs. We propose a novel RL algorithm, SWEET-RL, that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms.
arXiv Detail & Related papers (2025-03-19T17:55:08Z)
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition [34.32871896067864]
We propose a post-training framework called Reinforcement Learning via Self-Play (RLSP) RLSP involves three steps: supervised fine-tuning with human or synthetic demonstrations of the reasoning process, using an exploration reward signal to encourage diverse and efficient reasoning behaviors, and RL training with an outcome verifier to ensure correctness while preventing reward hacking. Empirical studies in the math domain show that RLSP improves reasoning.
arXiv Detail & Related papers (2025-02-10T18:52:04Z)
MALoRA: Mixture of Asymmetric Low-Rank Adaptation for Enhanced Multi-Task Learning [29.957620178740186]
In multi-task scenarios, challenges such as training imbalance and the seesaw effect frequently emerge. We propose Mixture of Asymmetric Low-Rank Adaptaion (MALoRA) as a flexible fine-tuning framework. MALoRA reduces the number of trainable parameters by 30% to 48%, increases training speed by 1.2x, and matches the computational efficiency of single-task LoRA models.
arXiv Detail & Related papers (2024-10-30T07:53:52Z)
RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences. We introduce a causal framework that learns preferences independent of these artifacts. Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z)
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [23.27203570485055]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models with human preferences. We propose a two-stage approach to train a reward model (RM) with multi-dimensional absolute-rating data. We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow on top of the ArmoRM.
arXiv Detail & Related papers (2024-06-18T17:58:28Z)
DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling [0.0]
We introduce the idea of Mixture-of-Experts (MoE) into the field of reward model (RM) training. We decompose the specific task into multiple capability dimensions and individually fine-tune a LoRA expert on each one. Our model attains superior consistency with human preference and outstrips advanced generative approaches.
arXiv Detail & Related papers (2024-03-02T12:31:22Z)
LoraRetriever: Input-Aware LoRA Retrieval and Composition for Mixed Tasks in the Wild [76.67343971195267]
Low-Rank Adaptation (LoRA) provides an efficient solution for fine-tuning large language models (LLM) LoraRetriever is a retrieve-then-compose framework that adaptively retrieves and composes multiple LoRAs according to the input prompts. Experimental results indicate that LoraRetriever consistently outperforms the baselines.
arXiv Detail & Related papers (2024-02-15T15:02:46Z)
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans. RaR serves as a simple yet effective prompting method for improving performance. We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z)
Confronting Reward Model Overoptimization with Constrained RLHF [114.71591361764547]
We show that correlation between component RMs has a significant effect on the locations of these points. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers.
arXiv Detail & Related papers (2023-10-06T16:59:17Z)
The Trickle-down Impact of Reward (In-)consistency on RLHF [71.37987812944971]
We show that reward inconsistency exhibits a trickle-down effect on the downstream Reinforcement Learning from Human Feedback process. We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM. We show that RLHF models trained with a more consistent RM yield more useful responses.
arXiv Detail & Related papers (2023-09-28T04:05:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.