The ART of LLM Refinement: Ask, Refine, and Trust
- URL: http://arxiv.org/abs/2311.07961v1
- Date: Tue, 14 Nov 2023 07:26:32 GMT
- Title: The ART of LLM Refinement: Ask, Refine, and Trust
- Authors: Kumar Shridhar, Koustuv Sinha, Andrew Cohen, Tianlu Wang, Ping Yu, Ram
Pasunuru, Mrinmaya Sachan, Jason Weston, Asli Celikyilmaz
- Abstract summary: We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust.
It asks necessary questions to decide when an LLM should refine its output.
It achieves a performance gain of +5 points over self-refinement baselines.
- Score: 85.75059530612882
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, Large Language Models (LLMs) have demonstrated remarkable
generative abilities, but can they judge the quality of their own generations?
A popular concept, referred to as self-refinement, postulates that LLMs can
detect and correct the errors in their generations when asked to do so.
However, recent empirical evidence points in the opposite direction, suggesting
that LLMs often struggle to accurately identify errors when reasoning is
involved. To address this, we propose a reasoning with refinement objective
called ART: Ask, Refine, and Trust, which asks necessary questions to decide
when an LLM should refine its output, and either affirm or withhold trust in
its refinement by ranking the refinement and the initial prediction. On two
multistep reasoning tasks of mathematical word problems (GSM8K) and question
answering (StrategyQA), ART achieves a performance gain of +5 points over
self-refinement baselines, while using a much smaller model as the decision
maker. We also demonstrate the benefit of using smaller models to make
refinement decisions as a cost-effective alternative to fine-tuning a larger
model.
Related papers
- Large Language Models have Intrinsic Self-Correction Ability [16.831123666582755]
Large language models suffer from hallucinations that will cause performance degradation.
One promising solution to improve the LLMs' performance is to ask LLMs to revise their answer after generation.
In intrinsic self-correction is considered a promising direction because it does not utilize external knowledge.
arXiv Detail & Related papers (2024-06-21T22:29:40Z) - SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses [49.148206387394936]
We show that models are not reliably better at discriminating among previously-generated alternatives than generating initial responses.
This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.
arXiv Detail & Related papers (2024-04-04T20:27:37Z) - Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model.
We employ a proxy model which has far fewer parameters, and take its answers as answers.
Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z) - Measuring Moral Inconsistencies in Large Language Models [16.47371312298185]
A Large Language Model (LLM) is considered consistent if semantically equivalent prompts produce semantically equivalent responses.
We show that even state-of-the-art LLMs are highly inconsistent in their generations, questioning their reliability.
We propose a novel information-theoretic measure called Semantic Graph Entropy (SGE) to measure the consistency of an LLM in moral scenarios.
arXiv Detail & Related papers (2024-01-26T18:05:47Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.