Self-Refine: Iterative Refinement with Self-Feedback
- URL: http://arxiv.org/abs/2303.17651v2
- Date: Thu, 25 May 2023 19:13:47 GMT
- Title: Self-Refine: Iterative Refinement with Self-Feedback
- Authors: Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao,
Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck,
Amir Yazdanbakhsh, Peter Clark
- Abstract summary: Self-Refine is an approach for improving initial outputs from large language models (LLMs) through iterative feedback and refinement.
We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs.
Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.
- Score: 62.78755306241981
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Like humans, large language models (LLMs) do not always generate the best
output on their first try. Motivated by how humans refine their written text,
we introduce Self-Refine, an approach for improving initial outputs from LLMs
through iterative feedback and refinement. The main idea is to generate an
initial output using an LLMs; then, the same LLMs provides feedback for its
output and uses it to refine itself, iteratively. Self-Refine does not require
any supervised training data, additional training, or reinforcement learning,
and instead uses a single LLM as the generator, refiner, and feedback provider.
We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response
generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT,
and GPT-4) LLMs. Across all evaluated tasks, outputs generated with Self-Refine
are preferred by humans and automatic metrics over those generated with the
same LLM using conventional one-step generation, improving by ~20% absolute on
average in task performance. Our work demonstrates that even state-of-the-art
LLMs like GPT-4 can be further improved at test time using our simple,
standalone approach.
Related papers
- Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem [14.537146664859902]
Like humans, large language models (LLMs) might not always produce optimal explanations on first attempt.
We introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively.
The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic.
arXiv Detail & Related papers (2024-09-11T09:21:20Z) - Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences [11.23629471911503]
EvalGen provides automated assistance to users in generating evaluation criteria and implementing assertions.
A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment.
We identify a phenomenon we dub emphcriteria drift: users need criteria to grade outputs, but grading outputs helps users define criteria.
arXiv Detail & Related papers (2024-04-18T15:45:27Z) - Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others.
We formally define LLM's self-bias - the tendency to favor its own generation.
We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z) - PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs.
We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality.
We propose LLMRefine, an inference time optimization method to refine LLM's output.
We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization.
LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Reinforced Self-Training (ReST) for Language Modeling [56.75447441157628]
Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences.
We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST)
Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.
arXiv Detail & Related papers (2023-08-17T14:12:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.