Can Large Language Models Really Improve by Self-critiquing Their Own
Plans?
- URL: http://arxiv.org/abs/2310.08118v1
- Date: Thu, 12 Oct 2023 08:22:37 GMT
- Title: Can Large Language Models Really Improve by Self-critiquing Their Own
Plans?
- Authors: Karthik Valmeekam, Matthew Marquez, Subbarao Kambhampati
- Abstract summary: We investigate the verification/self-critiquing abilities of large language models in the context of planning.
Using GPT-4, a state-of-the-art LLM, for both generation and verification, our findings reveal that self-critiquing appears to diminish plan generation performance.
- Score: 19.476470154121188
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There have been widespread claims about Large Language Models (LLMs) being
able to successfully verify or self-critique their candidate solutions in
reasoning problems in an iterative mode. Intrigued by those claims, in this
paper we set out to investigate the verification/self-critiquing abilities of
large language models in the context of planning. We evaluate a planning system
that employs LLMs for both plan generation and verification. We assess the
verifier LLM's performance against ground-truth verification, the impact of
self-critiquing on plan generation, and the influence of varying feedback
levels on system performance. Using GPT-4, a state-of-the-art LLM, for both
generation and verification, our findings reveal that self-critiquing appears
to diminish plan generation performance, especially when compared to systems
with external, sound verifiers and the LLM verifiers in that system produce a
notable number of false positives, compromising the system's reliability.
Additionally, the nature of feedback, whether binary or detailed, showed
minimal impact on plan generation. Collectively, our results cast doubt on the
effectiveness of LLMs in a self-critiquing, iterative framework for planning
tasks.
Related papers
- Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making [85.24399869971236]
We aim to evaluate Large Language Models (LLMs) for embodied decision making.
Existing evaluations tend to rely solely on a final success rate.
We propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks.
arXiv Detail & Related papers (2024-10-09T17:59:00Z) - Can We Rely on LLM Agents to Draft Long-Horizon Plans? Let's Take TravelPlanner as an Example [3.102303947219617]
Large language models (LLMs) have brought autonomous agents closer to artificial general intelligence (AGI)
We present our study using a realistic benchmark, TravelPlanner, where an agent must meet multiple constraints to generate accurate plans.
arXiv Detail & Related papers (2024-08-12T17:39:01Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs)
We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios.
We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - Evaluating Uncertainty-based Failure Detection for Closed-Loop LLM Planners [10.746821861109176]
Large Language Models (LLMs) have witnessed remarkable performance as zero-shot task planners for robotic tasks.
However, the open-loop nature of previous works makes LLM-based planning error-prone and fragile.
In this work, we introduce a framework for closed-loop LLM-based planning called KnowLoop, backed by an uncertainty-based MLLMs failure detector.
arXiv Detail & Related papers (2024-06-01T12:52:06Z) - Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.
We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems.
Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z) - On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks [17.329365493094542]
We present a principled empirical study of the performance of GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning.
We observe significant performance collapse with self-critique and significant performance gains with sound external verification.
arXiv Detail & Related papers (2024-02-12T23:11:01Z) - Automatically Correcting Large Language Models: Surveying the landscape
of diverse self-correction strategies [104.32199881187607]
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks.
A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output.
This paper presents a comprehensive review of this emerging class of techniques.
arXiv Detail & Related papers (2023-08-06T18:38:52Z) - On the Planning Abilities of Large Language Models : A Critical
Investigation [34.262740442260515]
We evaluate the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks.
In the LLM-Modulo setting, we demonstrate that LLM-generated plans can improve the search process for underlying sound planners.
arXiv Detail & Related papers (2023-05-25T06:32:23Z) - On the Risk of Misinformation Pollution with Large Language Models [127.1107824751703]
We investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation.
Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation in the performance of Open-Domain Question Answering (ODQA) systems.
arXiv Detail & Related papers (2023-05-23T04:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.