Related papers: Sample Efficient Preference Alignment in LLMs via Active Exploration

Sample Efficient Preference Alignment in LLMs via Active Exploration

URL: http://arxiv.org/abs/2312.00267v3
Date: Thu, 20 Mar 2025 14:23:17 GMT
Title: Sample Efficient Preference Alignment in LLMs via Active Exploration
Authors: Viraj Mehta, Syrine Belakaria, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Barbara Engelhardt, Stefano Ermon, Jeff Schneider, Willie Neiswanger,
Abstract summary: We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy.<n>We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound.<n>Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
Score: 63.84454768573154
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Preference-based feedback is important for many applications in machine learning where evaluation of a reward function is not feasible. Notable recent examples arise in preference alignment for large language models, including in reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). For many applications of preference alignment, the cost of acquiring human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy, and formalize the setting as an active contextual dueling bandit problem. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a polynomial worst-case regret bound. We extend the setting and methodology for practical use in preference alignment of large language models. We provide two extensions, an online and an offline approach. Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets including two new datasets that we contribute to the literature.

Related papers

Active Learning for Direct Preference Optimization [59.84525302418018]
Direct preference optimization (DPO) is a form of reinforcement learning from human feedback. We propose an active learning framework for DPO, which can be applied to collect human feedback online or to choose the most informative subset of already collected feedback offline.
arXiv Detail & Related papers (2025-03-03T00:36:31Z)
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications. Ensuring their alignment with the diverse preferences of individual users has become a critical challenge. We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
Active Preference-based Learning for Multi-dimensional Personalization [7.349038301460469]
Large language models (LLMs) have shown remarkable versatility across tasks, but aligning them with individual human preferences remains challenging. We propose an active preference learning framework that uses binary feedback to estimate user preferences across multiple objectives. We validate our approach through theoretical analysis and experiments on language generation tasks, demonstrating its feedback efficiency and effectiveness in personalizing model responses.
arXiv Detail & Related papers (2024-11-01T11:49:33Z)
Self-Evolved Reward Learning for LLMs [45.6910747154447]
Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences. We propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance.
arXiv Detail & Related papers (2024-11-01T07:29:03Z)
Navigating Noisy Feedback: Enhancing Reinforcement Learning with Error-Prone Language Models [8.025808955214957]
This paper studies the advantages and limitations of reinforcement learning from large language model feedback. We propose a simple yet effective method for soliciting and applying feedback as a potential-based shaping function.
arXiv Detail & Related papers (2024-10-22T19:52:08Z)
Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback. We frame the selection of an effective dataset as a simple regret minimization task. We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z)
MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time [50.41806216615488]
Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora. To make LLMs more usable, aligning them with human preferences is essential. We propose an effective method, textbf MetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time.
arXiv Detail & Related papers (2024-10-18T05:31:13Z)
Reinforcement Learning from Human Feedback with Active Queries [59.855433734053555]
Current reinforcement learning approaches often require a large amount of human-labelled preference data.<n>We propose query-efficient RLHF methods inspired by the success of active learning.<n>Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.
arXiv Detail & Related papers (2024-02-14T18:58:40Z)
MaxMin-RLHF: Alignment with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z)
Active Preference Learning for Large Language Models [12.093302163058436]
We develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
arXiv Detail & Related papers (2024-02-12T23:09:00Z)
Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data. We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z)
ULMA: Unified Language Model Alignment with Human Demonstration and Point-wise Preference [16.73260713938154]
A typical alignment procedure consists of supervised fine-tuning and preference learning. We introduce Point-wise Direct Preference Optimization, a novel preference learning method designed to harness point-wise feedback effectively. Our work also uncovers a novel connection between supervised fine-tuning and point-wise preference learning, culminating in Unified Language Model Alignment.
arXiv Detail & Related papers (2023-12-05T07:52:12Z)
Constructive Large Language Models Alignment with Diverse Feedback [76.9578950893839]
We introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance large language models alignment. We exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems. By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data.
arXiv Detail & Related papers (2023-10-10T09:20:14Z)
UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset. Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z)
Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems [82.92678837778358]
preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT. We show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches.
arXiv Detail & Related papers (2023-07-24T17:50:24Z)
Kernelized Offline Contextual Dueling Bandits [15.646879026749168]
In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback. We give an upper-confidence-bound style algorithm for this setting and prove a regret bound.
arXiv Detail & Related papers (2023-07-21T01:17:31Z)
SLiC-HF: Sequence Likelihood Calibration with Human Feedback [35.74135968442311]
We show how the recently introduced Sequence Likelihood (SLiC) can also be used to effectively learn from human preferences. Experiments on the TL;DR summarization task show that SLiC-HF significantly improves supervised fine-tuning baselines.
arXiv Detail & Related papers (2023-05-17T17:57:10Z)
Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z)
Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore. We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.