Enhancing Preference-based Linear Bandits via Human Response Time
- URL: http://arxiv.org/abs/2409.05798v4
- Date: Thu, 02 Jan 2025 12:00:28 GMT
- Title: Enhancing Preference-based Linear Bandits via Human Response Time
- Authors: Shen Li, Yuyang Zhang, Zhaolin Ren, Claire Liang, Na Li, Julie A. Shah,
- Abstract summary: Interactive preference learning systems infer human preferences by presenting queries as pairs of options and collecting binary choices.<n>We propose a method that combines choices and response times to estimate human utility functions.<n>We incorporate this estimator into preference-based linear bandits for fixed-budget best-arm identification.
- Score: 25.92686846689662
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Interactive preference learning systems infer human preferences by presenting queries as pairs of options and collecting binary choices. Although binary choices are simple and widely used, they provide limited information about preference strength. To address this, we leverage human response times, which are inversely related to preference strength, as an additional signal. We propose a computationally efficient method that combines choices and response times to estimate human utility functions, grounded in the EZ diffusion model from psychology. Theoretical and empirical analyses show that for queries with strong preferences, response times complement choices by providing extra information about preference strength, leading to significantly improved utility estimation. We incorporate this estimator into preference-based linear bandits for fixed-budget best-arm identification. Simulations on three real-world datasets demonstrate that using response times significantly accelerates preference learning compared to choice-only approaches. Additional materials, such as code, slides, and talk video, are available at https://shenlirobot.github.io/pages/NeurIPS24.html
Related papers
- Preference Learning with Response Time [18.659347526840822]
We propose novel methodologies to incorporate response time information alongside binary choice data.<n>We develop Neyman-orthogonal loss functions that achieve oracle convergence rates for reward model learning.<n>Our experiments validate our theoretical findings in the context of preference learning over images.
arXiv Detail & Related papers (2025-05-28T19:55:54Z) - Rethinking Diverse Human Preference Learning through Principal Component Analysis [22.123631189289963]
We introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons.
Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA)
DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training.
arXiv Detail & Related papers (2025-02-18T18:55:26Z) - Beyond the Binary: Capturing Diverse Preferences With Reward Regularization [15.518838657050173]
We argue that this reliance on binary choices does not capture the broader, aggregate preferences of the target user in real-world tasks.
We introduce a simple yet effective method that augments existing binary preference datasets with synthetic preference judgments to estimate potential user disagreement.
arXiv Detail & Related papers (2024-12-05T02:35:46Z) - Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality.
We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations.
We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.<n>We show that our approach consistently boosts DPO by a considerable margin.<n>Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - LRHP: Learning Representations for Human Preferences via Preference Pairs [45.056558199304554]
We introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences.
We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction.
arXiv Detail & Related papers (2024-10-06T14:48:28Z) - General Preference Modeling with Preference Representations for Aligning Language Models [51.14207112118503]
We introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently.
We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.
Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z) - Data-Centric Human Preference Optimization with Rationales [23.243583332894737]
Reinforcement learning from human feedback plays a crucial role in aligning language models towards human preferences.
This work shifts focus to improving preference learning through a data-centric approach.
We propose enriching existing preference datasets with machine-generated rationales that explain the reasons behind choices.
arXiv Detail & Related papers (2024-07-19T17:27:52Z) - Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning [70.22819290458581]
Reinforcement learning with human feedback (RLHF) is a widely adopted approach in current large language model pipelines.
Our approach introduces two key innovations: (1) on-policy query to avoid OOD and imbalance issues in seed data, and (2) active learning to select the most informative data for preference queries.
arXiv Detail & Related papers (2024-07-02T10:09:19Z) - Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback.
Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference.
Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z) - Online Self-Preferring Language Models [34.22412851864247]
Online Self-Preferring (OSP) language models learn from self-generated response pairs and self-judged preference strengths.
OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets.
arXiv Detail & Related papers (2024-05-23T02:13:34Z) - Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences.
We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs.
We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z) - Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy.
We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound.
Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.