Related papers: Reflective Verbal Reward Design for Pluralistic Alignment

Reflective Verbal Reward Design for Pluralistic Alignment

URL: http://arxiv.org/abs/2506.17834v1
Date: Sat, 21 Jun 2025 22:04:11 GMT
Title: Reflective Verbal Reward Design for Pluralistic Alignment
Authors: Carter Blair, Kate Larson, Edith Law,
Abstract summary: We present a novel reward modeling approach for learning individualized reward models.<n>Our approach uses a language model to guide users through reflective dialogues where they critique agent behavior and construct their preferences.<n>In studies with 30 participants, our method achieved a 9-12% improvement in accuracy over non-reflective verbal reward models.
Score: 10.1630183955549
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI agents are commonly aligned with "human values" through reinforcement learning from human feedback (RLHF), where a single reward model is learned from aggregated human feedback and used to align an agent's behavior. However, human values are not homogeneous--different people hold distinct and sometimes conflicting values. Aggregating feedback into a single reward model risks disproportionately suppressing minority preferences. To address this, we present a novel reward modeling approach for learning individualized reward models. Our approach uses a language model to guide users through reflective dialogues where they critique agent behavior and construct their preferences. This personalized dialogue history, containing the user's reflections and critiqued examples, is then used as context for another language model that serves as an individualized reward function (what we call a "verbal reward model") for evaluating new trajectories. In studies with 30 participants, our method achieved a 9-12% improvement in accuracy over non-reflective verbal reward models while being more sample efficient than traditional supervised learning methods.

Related papers

Capturing Individual Human Preferences with Reward Features [47.43999785878563]
We show that individual preferences can be captured as a linear combination of a set of general reward features.<n>We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual.<n>We present experiments with large language models comparing the proposed architecture with a non-adaptive reward model and also adaptive counterparts.
arXiv Detail & Related papers (2025-03-21T17:39:33Z)
RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z)
Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning [12.742158403867002]
Reinforcement Learning from Human Feedback is a powerful paradigm for aligning foundation models to human values and preferences. Current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population. We develop a class of multimodal RLHF methods to address the need for pluralistic alignment.
arXiv Detail & Related papers (2024-08-19T15:18:30Z)
Towards Understanding the Influence of Reward Margin on Preference Model Performance [8.891183078634786]
This study introduces a novel method to estimate the preference differences without the need for detailed, exhaustive labels from human annotators. Our experimental results provide empirical evidence that incorporating margin values into the training process significantly improves the effectiveness of reward models.
arXiv Detail & Related papers (2024-04-07T12:10:04Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation [29.6763730290473]
Reinforcement learning can align language models with non-differentiable reward signals, such as human preferences. This paper introduces a novel framework that utilizes the critique capability of Large Language Models to produce intermediate-step rewards.
arXiv Detail & Related papers (2024-01-14T22:05:11Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback [55.78118035358662]
Reinforcement learning from human feedback serves as a crucial bridge, aligning large language models with human and societal values. We have identified that the reward model often finds shortcuts to bypass its intended objectives. We propose an innovative solution, applying the Product-of-Experts technique to separate reward modeling from the influence of sequence length.
arXiv Detail & Related papers (2023-10-08T15:14:39Z)
Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z)
Fully Unsupervised Person Re-identification viaSelective Contrastive Learning [58.5284246878277]
Person re-identification (ReID) aims at searching the same identity person among images captured by various cameras. We propose a novel selective contrastive learning framework for unsupervised feature learning. Experimental results demonstrate the superiority of our method in unsupervised person ReID compared with the state-of-the-arts.
arXiv Detail & Related papers (2020-10-15T09:09:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.