Chain of Hindsight Aligns Language Models with Feedback
- URL: http://arxiv.org/abs/2302.02676v8
- Date: Wed, 18 Oct 2023 07:11:12 GMT
- Title: Chain of Hindsight Aligns Language Models with Feedback
- Authors: Hao Liu, Carmelo Sferrazza, Pieter Abbeel
- Abstract summary: We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
- Score: 62.68665658130472
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning from human preferences is important for language models to match
human needs and to align with human and social values. Prior works have
achieved remarkable successes by learning from human feedback to understand and
follow instructions. Nonetheless, these methods are either founded on
hand-picked model generations that are favored by human annotators, rendering
them inefficient in terms of data utilization and challenging to apply in
general, or they depend on reinforcement learning, which often suffers from
imperfect reward functions and relies on extremely challenging optimizations.
In this work, we propose a novel technique, Chain of Hindsight, that is easy to
optimize and can learn from any form of feedback, regardless of its polarity.
Our idea is inspired by how humans learn from extensive feedback presented in
the form of languages. We convert all types of feedback into sequences of
sentences, which are then used to fine-tune the model, allowing us to take
advantage of the language comprehension capabilities of language models. We
condition the model on a sequence of model generations paired with feedback. By
doing so, the model is trained to generate outputs based on feedback, while
learning to identify and correct negative attributes or errors. Applying our
method to large language models, we observed that Chain of Hindsight
significantly surpasses previous methods in aligning language models with human
preferences. We report significant improvements on summarization and dialogue
benchmarks, with our approach markedly preferred in human evaluations.
Related papers
- SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - Is Crowdsourcing Breaking Your Bank? Cost-Effective Fine-Tuning of
Pre-trained Language Models with Proximal Policy Optimization [18.75866961339424]
ChatGPT has highlighted the potential of reinforcement learning from human feedback.
To reduce labor costs, we propose a self-supervised text ranking approach.
arXiv Detail & Related papers (2024-02-28T12:24:07Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback.
ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements.
We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z) - Robust Preference Learning for Storytelling via Contrastive
Reinforcement Learning [53.92465205531759]
Controlled automated story generation seeks to generate natural language stories satisfying constraints from natural language critiques or preferences.
We train a contrastive bi-encoder model to align stories with human critiques, building a general purpose preference model.
We further fine-tune the contrastive reward model using a prompt-learning technique to increase story generation robustness.
arXiv Detail & Related papers (2022-10-14T13:21:33Z) - Training Language Models with Natural Language Feedback [51.36137482891037]
We learn from language feedback on model outputs using a three-step learning algorithm.
In synthetic experiments, we first evaluate whether language models accurately incorporate feedback to produce refinements.
Using only 100 samples of human-written feedback, our learning algorithm finetunes a GPT-3 model to roughly human-level summarization.
arXiv Detail & Related papers (2022-04-29T15:06:58Z) - Natural Language Inference with a Human Touch: Using Human Explanations
to Guide Model Attention [39.41947934589526]
Training with human explanations encourages models to attend more broadly across the sentences.
The supervised models attend to words humans believe are important, creating more robust and better performing NLI models.
arXiv Detail & Related papers (2021-04-16T14:45:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.