Fine-Grained Human Feedback Gives Better Rewards for Language Model
Training
- URL: http://arxiv.org/abs/2306.01693v2
- Date: Mon, 30 Oct 2023 06:34:35 GMT
- Title: Fine-Grained Human Feedback Gives Better Rewards for Language Model
Training
- Authors: Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj
Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi
- Abstract summary: Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs.
We introduce Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects.
- Score: 108.25635150124539
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models (LMs) often exhibit undesirable text generation behaviors,
including generating false, toxic, or irrelevant outputs. Reinforcement
learning from human feedback (RLHF) - where human preference judgments on LM
outputs are transformed into a learning signal - has recently shown promise in
addressing these issues. However, such holistic feedback conveys limited
information on long text outputs; it does not indicate which aspects of the
outputs influenced user preference; e.g., which parts contain what type(s) of
errors. In this paper, we use fine-grained human feedback (e.g., which sentence
is false, which sub-sentence is irrelevant) as an explicit training signal. We
introduce Fine-Grained RLHF, a framework that enables training and learning
from reward functions that are fine-grained in two respects: (1) density,
providing a reward after every segment (e.g., a sentence) is generated; and (2)
incorporating multiple reward models associated with different feedback types
(e.g., factual incorrectness, irrelevance, and information incompleteness). We
conduct experiments on detoxification and long-form question answering to
illustrate how learning with such reward functions leads to improved
performance, supported by both automatic and human evaluation. Additionally, we
show that LM behaviors can be customized using different combinations of
fine-grained reward models. We release all data, collected human feedback, and
codes at https://FineGrainedRLHF.github.io.
Related papers
- Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values.
Despite its effectiveness and popularity, RLHF is prone to biased local optimization.
We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z) - Aligning language models with human preferences [5.0994393083677]
Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills.
They also manifest behaviors that violate human preferences.
I explore several approaches to aligning LMs with human preferences.
arXiv Detail & Related papers (2024-04-18T12:55:18Z) - Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language
Model Critique in Text Generation [29.6763730290473]
Reinforcement learning can align language models with non-differentiable reward signals, such as human preferences.
This paper introduces a novel framework that utilizes the critique capability of Large Language Models to produce intermediate-step rewards.
arXiv Detail & Related papers (2024-01-14T22:05:11Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from
Human Feedback [5.037876196534672]
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings.
In this paper, we illustrate the causes of this issue, reviewing relevant literature from model-based reinforcement learning, and argue for solutions.
arXiv Detail & Related papers (2023-10-31T21:52:41Z) - UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset.
Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z) - Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback.
ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements.
We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.