Related papers: SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences

SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences

URL: http://arxiv.org/abs/2509.03672v1
Date: Wed, 03 Sep 2025 19:42:50 GMT
Title: SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences
Authors: Arpan Mukherjee, Marcello Bullo, Deniz Gündüz,
Abstract summary: Uniform-reward reinforcement learning from human feedback (RLHF) fails to capture the diversity of opinions across sub-populations.<n>We introduce a novel framework, termed em SharedRep-RLHF, to mitigate this drawback.<n>We show that SharedRep-RLHF is provably suboptimal in learning shared traits, and then quantify the sample complexity.
Score: 42.88222564741455
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Uniform-reward reinforcement learning from human feedback (RLHF), which trains a single reward model to represent the preferences of all annotators, fails to capture the diversity of opinions across sub-populations, inadvertently favoring dominant groups. The state-of-the-art, MaxMin-RLHF, addresses this by learning group-specific reward models, and by optimizing for the group receiving the minimum reward, thereby promoting fairness. However, we identify that a key limitation of MaxMin-RLHF is its poor performance when the minimum-reward group is a minority. To mitigate this drawback, we introduce a novel framework, termed {\em SharedRep-RLHF}. At its core, SharedRep-RLHF learns and leverages {\em shared traits} in annotations among various groups, in contrast to learning separate reward models across groups. We first show that MaxMin-RLHF is provably suboptimal in learning shared traits, and then quantify the sample complexity of SharedRep-RLHF. Experiments across diverse natural language tasks showcase the effectiveness of SharedRep-RLHF compared to MaxMin-RLHF with a gain of up to 20% in win rate.

Related papers

Generative RLHF-V: Learning Principles from Multi-modal Human Preference [15.068452240642884]
We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF.<n>We propose a two-stage pipeline: $textbfmulti-modal generative reward modeling from RL$, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores.<n>Our framework improves 4 MLLMs' performance across 7 benchmarks by $18.1%$, while the baseline RLHF is only $5.3%$.
arXiv Detail & Related papers (2025-05-24T05:50:07Z)
Reward Shaping to Mitigate Reward Hacking in RLHF [47.71454266800376]
Preference As Reward (PAR) is a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning.<n>On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches.
arXiv Detail & Related papers (2025-02-26T02:57:59Z)
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model [96.20350225621813]
Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference.<n>In this paper, we seek to get the best of both by training and utilizing a segment-level reward model.
arXiv Detail & Related papers (2025-01-06T06:17:56Z)
RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z)
UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function [14.7365465149829]
We propose textbfUNified textbfAlignment (UNA) which unifies RLHF/PPO, DPO and KTO.<n>With this novel mapping between a reward model and an optimal policy, UNA can 1.<n>outperform RLHF/PPO while simplify, stabilize, speed up and reduce memory burden of RL fine-tuning process.
arXiv Detail & Related papers (2024-08-27T18:04:07Z)
Provable Multi-Party Reinforcement Learning with Diverse Human Feedback [63.830731470186855]
Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals. We incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties.
arXiv Detail & Related papers (2024-03-08T03:05:11Z)
Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data. We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z)
RRHF: Rank Responses to Align Language Models with Human Feedback without tears [69.68672043223249]
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO) We propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities. We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling.
arXiv Detail & Related papers (2023-04-11T15:53:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.