ARGS: Alignment as Reward-Guided Search
- URL: http://arxiv.org/abs/2402.01694v1
- Date: Tue, 23 Jan 2024 23:42:41 GMT
- Title: ARGS: Alignment as Reward-Guided Search
- Authors: Maxim Khanov, Jirayu Burapacheep, Yixuan Li
- Abstract summary: We introduce ARGS, Alignment as Reward-Guided Search, a novel framework that integrates alignment into the decoding process.
By adjusting the model's probabilistic predictions using a reward signal, ARGS generates texts with semantic diversity while being aligned with human preferences.
We believe that our framework, emphasizing decoding-time alignment, paves the way for more responsive language models in the future.
- Score: 17.420727709895736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aligning large language models with human objectives is paramount, yet common
approaches including RLHF suffer from unstable and resource-intensive training.
In response to this challenge, we introduce ARGS, Alignment as Reward-Guided
Search, a novel framework that integrates alignment into the decoding process,
eliminating the need for expensive RL training. By adjusting the model's
probabilistic predictions using a reward signal, ARGS generates texts with
semantic diversity while being aligned with human preferences, offering a
promising and flexible solution for aligning language models. Notably, ARGS
demonstrates consistent enhancements in average reward compared to baselines
across diverse alignment tasks and various model dimensions. For example, under
the same greedy-based decoding strategy, our method improves the average reward
by 19.56% relative to the baseline and secures a preference or tie score of
64.33% in GPT-4 evaluation. We believe that our framework, emphasizing
decoding-time alignment, paves the way for more responsive language models in
the future. Code is publicly available at:
\url{https://github.com/deeplearning-wisc/args}.
Related papers
- Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback.
We frame the selection of an effective dataset as a simple regret minimization task.
We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z) - Multi-turn Reinforcement Learning from Preference Human Feedback [41.327438095745315]
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models with human preferences.
Existing methods work by emulating the preferences at the single decision (turn) level.
We develop novel methods for Reinforcement Learning from preference feedback between two full multi-turn conversations.
arXiv Detail & Related papers (2024-05-23T14:53:54Z) - Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.
Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution.
We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z) - ALaRM: Align Language Models via Hierarchical Rewards Modeling [41.79125107279527]
We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback.
The framework addresses the limitations of current alignment approaches, by integrating holistic rewards with aspect-specific rewards.
We validate our approach through applications in long-form question answering and machine translation tasks.
arXiv Detail & Related papers (2024-03-11T14:28:40Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step.
Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z) - The Wisdom of Hindsight Makes Language Models Better Instruction
Followers [84.9120606803906]
Reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback.
In this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner.
We propose Hindsight Instruction Relabeling (HIR), a novel algorithm for aligning language models with instructions.
arXiv Detail & Related papers (2023-02-10T12:16:38Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.