On the Planning Abilities of Large Language Models (A Critical
Investigation with a Proposed Benchmark)
- URL: http://arxiv.org/abs/2302.06706v1
- Date: Mon, 13 Feb 2023 21:37:41 GMT
- Title: On the Planning Abilities of Large Language Models (A Critical
Investigation with a Proposed Benchmark)
- Authors: Karthik Valmeekam, Sarath Sreedharan, Matthew Marquez, Alberto Olmo,
Subbarao Kambhampati
- Abstract summary: We develop a benchmark suite based on the kinds of domains employed in the International Planning Competition.
We evaluate LLMs in three modes: autonomous, human-in-the-loop and human-in-the-loop.
Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging only about 3% success rate.
- Score: 30.223130782579336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Intrigued by the claims of emergent reasoning capabilities in LLMs trained on
general web corpora, in this paper, we set out to investigate their planning
capabilities. We aim to evaluate (1) how good LLMs are by themselves in
generating and validating simple plans in commonsense planning tasks (of the
type that humans are generally quite good at) and (2) how good LLMs are in
being a source of heuristic guidance for other agents--either AI planners or
human planners--in their planning tasks. To investigate these questions in a
systematic rather than anecdotal manner, we start by developing a benchmark
suite based on the kinds of domains employed in the International Planning
Competition. On this benchmark, we evaluate LLMs in three modes: autonomous,
heuristic and human-in-the-loop. Our results show that LLM's ability to
autonomously generate executable plans is quite meager, averaging only about 3%
success rate. The heuristic and human-in-the-loop modes show slightly more
promise. In addition to these results, we also make our benchmark and
evaluation tools available to support investigations by research community.
Related papers
- MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - Can We Rely on LLM Agents to Draft Long-Horizon Plans? Let's Take TravelPlanner as an Example [3.102303947219617]
Large language models (LLMs) have brought autonomous agents closer to artificial general intelligence (AGI)
We present our study using a realistic benchmark, TravelPlanner, where an agent must meet multiple constraints to generate accurate plans.
arXiv Detail & Related papers (2024-08-12T17:39:01Z) - WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [85.95607119635102]
Large language models (LLMs) can mimic human-like intelligence.
WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents.
arXiv Detail & Related papers (2024-07-07T07:15:49Z) - Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs)
We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios.
We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - Agent Planning with World Knowledge Model [88.4897773735576]
We introduce parametric World Knowledge Model (WKM) to facilitate agent planning.
We develop WKM, providing prior task knowledge to guide the global planning and dynamic state knowledge to assist the local planning.
Our method can achieve superior performance compared to various strong baselines.
arXiv Detail & Related papers (2024-05-23T06:03:19Z) - Understanding the planning of LLM agents: A survey [98.82513390811148]
This survey provides the first systematic view of LLM-based agents planning, covering recent works aiming to improve planning ability.
Comprehensive analyses are conducted for each direction, and further challenges in the field of research are discussed.
arXiv Detail & Related papers (2024-02-05T04:25:24Z) - SayCanPay: Heuristic Planning with Large Language Models using Learnable
Domain Knowledge [14.024233628092167]
Large Language Models (LLMs) have demonstrated impressive planning abilities due to their vast "world knowledge"
Yet, obtaining plans that are both feasible (grounded in affordances) and cost-effective (in plan length) remains a challenge, despite recent progress.
This contrasts with planning methods that employ domain knowledge (formalized in action models such as PDDL) and search to generate feasible, optimal plans.
arXiv Detail & Related papers (2023-08-24T09:47:28Z) - On the Planning Abilities of Large Language Models : A Critical
Investigation [34.262740442260515]
We evaluate the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks.
In the LLM-Modulo setting, we demonstrate that LLM-generated plans can improve the search process for underlying sound planners.
arXiv Detail & Related papers (2023-05-25T06:32:23Z) - PlanBench: An Extensible Benchmark for Evaluating Large Language Models
on Planning and Reasoning about Change [34.93870615625937]
PlanBench is a benchmark suite based on the kinds of domains used in the automated planning community.
PlanBench provides sufficient diversity in both the task domains and the specific planning capabilities.
arXiv Detail & Related papers (2022-06-21T16:15:27Z) - ElitePLM: An Empirical Study on General Language Ability Evaluation of
Pretrained Language Models [78.08792285698853]
We present a large-scale empirical study on general language ability evaluation of pretrained language models (ElitePLM)
Our empirical results demonstrate that: (1) PLMs with varying training objectives and strategies are good at different ability tests; (2) fine-tuning PLMs in downstream tasks is usually sensitive to the data size and distribution; and (3) PLMs have excellent transferability between similar tasks.
arXiv Detail & Related papers (2022-05-03T14:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.