Related papers: Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback

Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback

URL: http://arxiv.org/abs/2310.05199v5
Date: Wed, 29 Nov 2023 14:45:53 GMT
Title: Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback
Authors: Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract summary: Reinforcement learning from human feedback serves as a crucial bridge, aligning large language models with human and societal values. We have identified that the reward model often finds shortcuts to bypass its intended objectives. We propose an innovative solution, applying the Product-of-Experts technique to separate reward modeling from the influence of sequence length.
Score: 55.78118035358662
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning from human feedback serves as a crucial bridge, aligning large language models with human and societal values. This alignment requires a vast corpus of human feedback to learn a reward model, which is subsequently used to finetune language models. However, we have identified that the reward model often finds shortcuts to bypass its intended objectives, misleadingly assuming that humans prefer longer responses. The emergence of length bias often induces the model to favor longer outputs, yet it doesn't equate to an increase in helpful information within these outputs. In this paper, we propose an innovative solution, applying the Product-of-Experts (PoE) technique to separate reward modeling from the influence of sequence length. In our framework, the main expert concentrates on understanding human intents, while the biased expert targets the identification and capture of length bias. To further enhance the learning of bias, we introduce perturbations into the bias-focused expert, disrupting the flow of semantic information. Experimental results validate the effectiveness of our approach, indicating that language model performance is improved, irrespective of sequence length.

Related papers

Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling [87.17041933863041]
We introduce a Response-conditioned Bradley-Terry (Rc-BT) model that enhances the reward model's capability in length bias mitigating and length instruction following. We also propose the Rc-DPO algorithm to leverage the Rc-BT model for direct policy optimization (DPO) of large language models.
arXiv Detail & Related papers (2025-02-02T14:50:25Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour [0.8133739801185272]
We show that Large Language Models (LLMs) show sycophantic tendencies when responding to queries involving subjective opinions and statements. LLMs at various scales seem not to follow the users' hints by demonstrating confidence in delivering the correct answers.
arXiv Detail & Related papers (2023-11-15T22:18:33Z)
A Long Way to Go: Investigating Length Correlations in RLHF [59.49656695716066]
This paper demonstrates, on three diverse settings, that optimizing for response length is a significant factor behind RLHF. We find improvements in reward to largely be driven by increasing response length, instead of other features. Even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models.
arXiv Detail & Related papers (2023-10-05T17:38:28Z)
Human Feedback is not Gold Standard [28.63384327791185]
We critically analyse the use of human feedback for both training and evaluation. We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality.
arXiv Detail & Related papers (2023-09-28T11:18:20Z)
Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z)
Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z)
Causal Disentanglement for Semantics-Aware Intent Learning in Recommendation [30.85573846018658]
We propose an unbiased and semantics-aware disentanglement learning called CaDSI. CaDSI explicitly models the causal relations underlying recommendation task. It produces semantics-aware representations via disentangling users true intents aware of specific item context.
arXiv Detail & Related papers (2022-02-05T15:17:03Z)
Natural Language Inference with a Human Touch: Using Human Explanations to Guide Model Attention [39.41947934589526]
Training with human explanations encourages models to attend more broadly across the sentences. The supervised models attend to words humans believe are important, creating more robust and better performing NLI models.
arXiv Detail & Related papers (2021-04-16T14:45:35Z)
Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures [62.562760228942054]
Existing approaches to improve robustness against dataset biases mostly focus on changing the training objective. We propose to augment the input sentences in the training data with their corresponding predicate-argument structures. We show that without targeting a specific bias, our sentence augmentation improves the robustness of transformer models against multiple biases.
arXiv Detail & Related papers (2020-10-23T16:22:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.