SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences
- URL: http://arxiv.org/abs/2509.03672v1
- Date: Wed, 03 Sep 2025 19:42:50 GMT
- Title: SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences
- Authors: Arpan Mukherjee, Marcello Bullo, Deniz Gündüz,
- Abstract summary: Uniform-reward reinforcement learning from human feedback (RLHF) fails to capture the diversity of opinions across sub-populations.<n>We introduce a novel framework, termed em SharedRep-RLHF, to mitigate this drawback.<n>We show that SharedRep-RLHF is provably suboptimal in learning shared traits, and then quantify the sample complexity.
- Score: 42.88222564741455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Uniform-reward reinforcement learning from human feedback (RLHF), which trains a single reward model to represent the preferences of all annotators, fails to capture the diversity of opinions across sub-populations, inadvertently favoring dominant groups. The state-of-the-art, MaxMin-RLHF, addresses this by learning group-specific reward models, and by optimizing for the group receiving the minimum reward, thereby promoting fairness. However, we identify that a key limitation of MaxMin-RLHF is its poor performance when the minimum-reward group is a minority. To mitigate this drawback, we introduce a novel framework, termed {\em SharedRep-RLHF}. At its core, SharedRep-RLHF learns and leverages {\em shared traits} in annotations among various groups, in contrast to learning separate reward models across groups. We first show that MaxMin-RLHF is provably suboptimal in learning shared traits, and then quantify the sample complexity of SharedRep-RLHF. Experiments across diverse natural language tasks showcase the effectiveness of SharedRep-RLHF compared to MaxMin-RLHF with a gain of up to 20% in win rate.
Related papers
- Generative RLHF-V: Learning Principles from Multi-modal Human Preference [15.068452240642884]
We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF.<n>We propose a two-stage pipeline: $textbfmulti-modal generative reward modeling from RL$, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores.<n>Our framework improves 4 MLLMs' performance across 7 benchmarks by $18.1%$, while the baseline RLHF is only $5.3%$.
arXiv Detail & Related papers (2025-05-24T05:50:07Z) - Reward Shaping to Mitigate Reward Hacking in RLHF [47.71454266800376]
Preference As Reward (PAR) is a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning.<n>On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches.
arXiv Detail & Related papers (2025-02-26T02:57:59Z) - Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model [96.20350225621813]
Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference.<n>In this paper, we seek to get the best of both by training and utilizing a segment-level reward model.
arXiv Detail & Related papers (2025-01-06T06:17:56Z) - RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function [14.7365465149829]
We propose textbfUNified textbfAlignment (UNA) which unifies RLHF/PPO, DPO and KTO.<n>With this novel mapping between a reward model and an optimal policy, UNA can 1.<n>outperform RLHF/PPO while simplify, stabilize, speed up and reduce memory burden of RL fine-tuning process.
arXiv Detail & Related papers (2024-08-27T18:04:07Z) - Provable Multi-Party Reinforcement Learning with Diverse Human Feedback [63.830731470186855]
Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences.
We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals.
We incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties.
arXiv Detail & Related papers (2024-03-08T03:05:11Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - RRHF: Rank Responses to Align Language Models with Human Feedback
without tears [69.68672043223249]
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO)
We propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities.
We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling.
arXiv Detail & Related papers (2023-04-11T15:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.