Related papers: Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling

Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling

URL: http://arxiv.org/abs/2502.00814v2
Date: Mon, 19 May 2025 08:24:51 GMT
Title: Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling
Authors: Jianfeng Cai, Jinhua Zhu, Ruopei Sun, Yue Wang, Li Li, Wengang Zhou, Houqiang Li,
Abstract summary: Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs)<n>We introduce a $textbfR$esponse-$textbfc$onditioned $textbfB$radley-$textbfT$erry (Rc-BT) model that enhances the model's capability in length bias mitigating and length instruction following.<n>We also propose the Rc-RM and Rc-DPO algorithm to leverage the Rc-BT model for reward modeling and direct policy optimization
Score: 87.17041933863041
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs) by modeling human preferences with a learnable reward model and employing a reinforcement learning algorithm to maximize the reward model's scores. However, these reward models are susceptible to exploitation through various superficial confounding factors, with length bias emerging as a particularly significant concern. Moreover, while the pronounced impact of length bias on preference modeling suggests that LLMs possess an inherent sensitivity to length perception, our preliminary investigations reveal that fine-tuned LLMs consistently struggle to adhere to explicit length instructions. To address these two limitations, we propose a novel framework wherein the reward model explicitly differentiates between human semantic preferences and response length requirements. Specifically, we introduce a $\textbf{R}$esponse-$\textbf{c}$onditioned $\textbf{B}$radley-$\textbf{T}$erry (Rc-BT) model that enhances the model's capability in length bias mitigating and length instruction following, through training on our augmented dataset. Furthermore, we propose the Rc-RM and Rc-DPO algorithm to leverage the Rc-BT model for reward modeling and direct policy optimization (DPO) of LLMs, simultaneously mitigating length bias and promoting adherence to length instructions. Extensive experiments across various foundational models and datasets demonstrate the effectiveness and generalizability of our approach.

Related papers

Crowd-SFT: Crowdsourcing for LLM Alignment [4.648677931378919]
Large Language Models (LLMs) increasingly rely on Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF)<n>We propose an open, crowd-sourced fine-tuning framework that enables broader feedback collection for SFT without extensive annotator training.<n>Our framework promotes incentive fairness via a point-based reward system correlated with Shapley values and guides model convergence through iterative model updates.
arXiv Detail & Related papers (2025-06-04T15:26:38Z)
Bias Fitting to Mitigate Length Bias of Reward Model in RLHF [81.44256822500257]
Reinforcement Learning from Human Feedback relies on reward models to align large language models with human preferences.<n>We propose FiMi-RM, a framework that autonomously learns and corrects underlying bias patterns.<n> Experimental results demonstrate that FiMi-RM achieves a more balanced length-reward distribution.
arXiv Detail & Related papers (2025-05-19T08:29:28Z)
Detecting Prefix Bias in LLM-based Reward Models [4.596249232904721]
We introduce novel methods to detect and evaluate prefix bias in reward models trained on preference datasets.<n>We leverage these metrics to reveal significant biases in preference models across racial and gender dimensions.<n>Our findings highlight the critical need for bias-aware dataset design and evaluation in developing fair and reliable reward models.
arXiv Detail & Related papers (2025-05-13T21:50:03Z)
IPO: Your Language Model is Secretly a Preference Classifier [1.8921784053120494]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models with human preferences. We propose Implicit Preference Optimization (IPO), an alternative approach that leverages generative language models as preference classifiers. Our findings demonstrate that models trained through IPO achieve performance comparable to those utilizing state-of-the-art reward models for obtaining preferences.
arXiv Detail & Related papers (2025-02-22T10:59:11Z)
On the Modeling Capabilities of Large Language Models for Sequential Decision Making [52.128546842746246]
Large pretrained models are showing increasingly better performance in reasoning and planning tasks. We evaluate their ability to produce decision-making policies, either directly, by generating actions, or indirectly. In environments with unfamiliar dynamics, we explore how fine-tuning LLMs with synthetic data can significantly improve their reward modeling capabilities.
arXiv Detail & Related papers (2024-10-08T03:12:57Z)
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z)
Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures. We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z)
TSO: Self-Training with Scaled Preference Optimization [14.3799656174528]
We propose TSO, a framework for preference optimization that conducts self-training preference learning without training an additional reward model. TSO enhances the diversity of responses by constructing a model matrix and incorporating human preference responses. Experimental results demonstrate that TSO outperforms existing mainstream methods on various alignment evaluation benchmarks.
arXiv Detail & Related papers (2024-08-31T05:37:01Z)
DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging [65.41765072566287]
We propose textbfDomain knowledtextbfge merged textbfReward textbfModel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging.
arXiv Detail & Related papers (2024-07-01T17:01:54Z)
Prototypical Reward Network for Data-Efficient RLHF [17.220998116937444]
A reward model for Reinforcement Learning from Human Feedback (RLHF) has proven effective in fine-tuning Large Language Models (LLMs) Our proposed framework Proto-RM leverages prototypical networks to enhance reward models under limited human feedback.
arXiv Detail & Related papers (2024-06-06T15:23:30Z)
Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback [55.78118035358662]
Reinforcement learning from human feedback serves as a crucial bridge, aligning large language models with human and societal values. We have identified that the reward model often finds shortcuts to bypass its intended objectives. We propose an innovative solution, applying the Product-of-Experts technique to separate reward modeling from the influence of sequence length.
arXiv Detail & Related papers (2023-10-08T15:14:39Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.