Related papers: On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)

On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)

URL: http://arxiv.org/abs/2302.06706v1
Date: Mon, 13 Feb 2023 21:37:41 GMT
Title: On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)
Authors: Karthik Valmeekam, Sarath Sreedharan, Matthew Marquez, Alberto Olmo, Subbarao Kambhampati
Abstract summary: We develop a benchmark suite based on the kinds of domains employed in the International Planning Competition. We evaluate LLMs in three modes: autonomous, human-in-the-loop and human-in-the-loop. Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging only about 3% success rate.
Score: 30.223130782579336
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) how good LLMs are by themselves in generating and validating simple plans in commonsense planning tasks (of the type that humans are generally quite good at) and (2) how good LLMs are in being a source of heuristic guidance for other agents--either AI planners or human planners--in their planning tasks. To investigate these questions in a systematic rather than anecdotal manner, we start by developing a benchmark suite based on the kinds of domains employed in the International Planning Competition. On this benchmark, we evaluate LLMs in three modes: autonomous, heuristic and human-in-the-loop. Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging only about 3% success rate. The heuristic and human-in-the-loop modes show slightly more promise. In addition to these results, we also make our benchmark and evaluation tools available to support investigations by research community.

Related papers

PlanGenLLMs: A Modern Survey of LLM Planning Capabilities [12.322175348741435]
LLMs have immense potential for generating plans, transforming an initial world state into a desired goal state. Many of these systems are tailored to specific problems, making it challenging to compare them or determine the best approach for new tasks. Our survey aims to offer a comprehensive overview of current LLM planners to fill this gap. It builds on foundational work by Kartam and Wilkins (1990) and examines six key performance criteria: completeness, executability, optimality, representation, generalization, and efficiency.
arXiv Detail & Related papers (2025-02-16T17:54:57Z)
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z)
LLMs Can Plan Only If We Tell Them [16.593590353705697]
Large language models (LLMs) have demonstrated significant capabilities in natural language processing and reasoning. This paper investigates whether LLMs can independently generate long-horizon plans that rival human baselines.
arXiv Detail & Related papers (2025-01-23T10:46:14Z)
Query-Efficient Planning with Language Models [8.136901056728945]
Planning in complex environments requires an agent to efficiently query a world model to find a sequence of actions from start to goal. Recent work has shown that Large Language Models (LLMs) can potentially help with planning by searching over promising states and adapting to feedback from the world. We show that while both approaches improve upon comparable baselines, using an LLM as a generative planner results in significantly fewer interactions.
arXiv Detail & Related papers (2024-12-09T02:51:21Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
Can We Rely on LLM Agents to Draft Long-Horizon Plans? Let's Take TravelPlanner as an Example [3.102303947219617]
Large language models (LLMs) have brought autonomous agents closer to artificial general intelligence (AGI) We present our study using a realistic benchmark, TravelPlanner, where an agent must meet multiple constraints to generate accurate plans.
arXiv Detail & Related papers (2024-08-12T17:39:01Z)
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [85.95607119635102]
Large language models (LLMs) can mimic human-like intelligence. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents.
arXiv Detail & Related papers (2024-07-07T07:15:49Z)
Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs) We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios. We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z)
Agent Planning with World Knowledge Model [88.4897773735576]
We introduce parametric World Knowledge Model (WKM) to facilitate agent planning. We develop WKM, providing prior task knowledge to guide the global planning and dynamic state knowledge to assist the local planning. Our method can achieve superior performance compared to various strong baselines.
arXiv Detail & Related papers (2024-05-23T06:03:19Z)
Understanding the planning of LLM agents: A survey [98.82513390811148]
This survey provides the first systematic view of LLM-based agents planning, covering recent works aiming to improve planning ability. Comprehensive analyses are conducted for each direction, and further challenges in the field of research are discussed.
arXiv Detail & Related papers (2024-02-05T04:25:24Z)
SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge [14.024233628092167]
Large Language Models (LLMs) have demonstrated impressive planning abilities due to their vast "world knowledge" Yet, obtaining plans that are both feasible (grounded in affordances) and cost-effective (in plan length) remains a challenge, despite recent progress. This contrasts with planning methods that employ domain knowledge (formalized in action models such as PDDL) and search to generate feasible, optimal plans.
arXiv Detail & Related papers (2023-08-24T09:47:28Z)
On the Planning Abilities of Large Language Models : A Critical Investigation [34.262740442260515]
We evaluate the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks. In the LLM-Modulo setting, we demonstrate that LLM-generated plans can improve the search process for underlying sound planners.
arXiv Detail & Related papers (2023-05-25T06:32:23Z)
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change [34.93870615625937]
PlanBench is a benchmark suite based on the kinds of domains used in the automated planning community. PlanBench provides sufficient diversity in both the task domains and the specific planning capabilities.
arXiv Detail & Related papers (2022-06-21T16:15:27Z)
ElitePLM: An Empirical Study on General Language Ability Evaluation of Pretrained Language Models [78.08792285698853]
We present a large-scale empirical study on general language ability evaluation of pretrained language models (ElitePLM) Our empirical results demonstrate that: (1) PLMs with varying training objectives and strategies are good at different ability tests; (2) fine-tuning PLMs in downstream tasks is usually sensitive to the data size and distribution; and (3) PLMs have excellent transferability between similar tasks.
arXiv Detail & Related papers (2022-05-03T14:18:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.