Training language models to follow instructions with human feedback
- URL: http://arxiv.org/abs/2203.02155v1
- Date: Fri, 4 Mar 2022 07:04:42 GMT
- Title: Training language models to follow instructions with human feedback
- Authors: Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright,
Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda
Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe
- Abstract summary: We show an avenue for aligning language models with user intent by fine-tuning with human feedback.
InstructGPT models show improvements in truthfulness and reductions in toxic output generation.
- Score: 29.590666996229206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Making language models bigger does not inherently make them better at
following a user's intent. For example, large language models can generate
outputs that are untruthful, toxic, or simply not helpful to the user. In other
words, these models are not aligned with their users. In this paper, we show an
avenue for aligning language models with user intent on a wide range of tasks
by fine-tuning with human feedback. Starting with a set of labeler-written
prompts and prompts submitted through the OpenAI API, we collect a dataset of
labeler demonstrations of the desired model behavior, which we use to fine-tune
GPT-3 using supervised learning. We then collect a dataset of rankings of model
outputs, which we use to further fine-tune this supervised model using
reinforcement learning from human feedback. We call the resulting models
InstructGPT. In human evaluations on our prompt distribution, outputs from the
1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3,
despite having 100x fewer parameters. Moreover, InstructGPT models show
improvements in truthfulness and reductions in toxic output generation while
having minimal performance regressions on public NLP datasets. Even though
InstructGPT still makes simple mistakes, our results show that fine-tuning with
human feedback is a promising direction for aligning language models with human
intent.
Related papers
- Improving Classification Performance With Human Feedback: Label a few,
we label the rest [2.7386128680964408]
This paper focuses on understanding how a continuous feedback loop can refine models, thereby enhancing their accuracy, recall, and precision.
We benchmark this approach on the Financial Phrasebank, Banking, Craigslist, Trec, Amazon Reviews datasets to prove that with just a few labeled examples, we are able to surpass the accuracy of zero shot large language models.
arXiv Detail & Related papers (2024-01-17T19:13:05Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback.
ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements.
We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z) - Self-Instruct: Aligning Language Models with Self-Generated Instructions [76.42871502364697]
Self-Instruct is a framework for improving the instruction-following capabilities of pretrained language models.
Our pipeline generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model.
For further evaluation, we curate a set of expert-written instructions for novel tasks, and show through human evaluation that tuning GPT3 with Self-Instruct outperforms using existing public instruction datasets by a large margin.
arXiv Detail & Related papers (2022-12-20T18:59:19Z) - Training Language Models with Natural Language Feedback [51.36137482891037]
We learn from language feedback on model outputs using a three-step learning algorithm.
In synthetic experiments, we first evaluate whether language models accurately incorporate feedback to produce refinements.
Using only 100 samples of human-written feedback, our learning algorithm finetunes a GPT-3 model to roughly human-level summarization.
arXiv Detail & Related papers (2022-04-29T15:06:58Z) - TuringAdvice: A Generative and Dynamic Evaluation of Language Use [90.3029315711237]
We propose TuringAdvice, a new challenge task and dataset for language understanding models.
Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language.
Empirical results show that today's models struggle at TuringAdvice.
arXiv Detail & Related papers (2020-04-07T18:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.