PlanBench: An Extensible Benchmark for Evaluating Large Language Models
on Planning and Reasoning about Change
- URL: http://arxiv.org/abs/2206.10498v4
- Date: Sun, 26 Nov 2023 01:15:41 GMT
- Title: PlanBench: An Extensible Benchmark for Evaluating Large Language Models
on Planning and Reasoning about Change
- Authors: Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan,
Subbarao Kambhampati
- Abstract summary: PlanBench is a benchmark suite based on the kinds of domains used in the automated planning community.
PlanBench provides sufficient diversity in both the task domains and the specific planning capabilities.
- Score: 34.93870615625937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating plans of action, and reasoning about change have long been
considered a core competence of intelligent agents. It is thus no surprise that
evaluating the planning and reasoning capabilities of large language models
(LLMs) has become a hot topic of research. Most claims about LLM planning
capabilities are however based on common sense tasks-where it becomes hard to
tell whether LLMs are planning or merely retrieving from their vast world
knowledge. There is a strong need for systematic and extensible planning
benchmarks with sufficient diversity to evaluate whether LLMs have innate
planning capabilities. Motivated by this, we propose PlanBench, an extensible
benchmark suite based on the kinds of domains used in the automated planning
community, especially in the International Planning Competition, to test the
capabilities of LLMs in planning or reasoning about actions and change.
PlanBench provides sufficient diversity in both the task domains and the
specific planning capabilities. Our studies also show that on many critical
capabilities-including plan generation-LLM performance falls quite short, even
with the SOTA models. PlanBench can thus function as a useful marker of
progress of LLMs in planning and reasoning.
Related papers
- Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs [59.76268575344119]
We introduce a novel framework for enhancing large language models' (LLMs) planning capabilities by using planning data derived from knowledge graphs (KGs)
LLMs fine-tuned with KG data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval.
arXiv Detail & Related papers (2024-06-20T13:07:38Z) - Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
We construct a benchmark suite encompassing both classical planning domains and natural language scenarios.
Second, we investigate the use of in-context learning (ICL) to enhance LLM planning, exploring the direct relationship between increased context length and improved planning performance.
Third, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths, as well as the effectiveness of incorporating model-driven search procedures.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - What's the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models [7.216683826556268]
Large language models (LLMs) are increasingly used for applications that require planning capabilities.
We introduce SimPlan, a novel hybrid-method, and evaluate its performance in a new challenging setup.
arXiv Detail & Related papers (2024-02-18T07:42:49Z) - Understanding the planning of LLM agents: A survey [98.82513390811148]
This survey provides the first systematic view of LLM-based agents planning, covering recent works aiming to improve planning ability.
Comprehensive analyses are conducted for each direction, and further challenges in the field of research are discussed.
arXiv Detail & Related papers (2024-02-05T04:25:24Z) - On the Prospects of Incorporating Large Language Models (LLMs) in
Automated Planning and Scheduling (APS) [23.024862968785147]
This paper investigates eight categories based on the unique applications of LLMs in addressing various aspects of planning problems.
A critical insight resulting from our review is that the true potential of LLMs unfolds when they are integrated with traditional symbolic planners.
arXiv Detail & Related papers (2024-01-04T19:22:09Z) - LLM-Assist: Enhancing Closed-Loop Planning with Language-Based Reasoning [65.86754998249224]
We develop a novel hybrid planner that leverages a conventional rule-based planner in conjunction with an LLM-based planner.
Our approach navigates complex scenarios which existing planners struggle with, produces well-reasoned outputs while also remaining grounded through working alongside the rule-based approach.
arXiv Detail & Related papers (2023-12-30T02:53:45Z) - EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios.
We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning.
We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z) - AdaPlanner: Adaptive Planning from Feedback with Language Models [56.367020818139665]
Large language models (LLMs) have recently demonstrated the potential in acting as autonomous agents for sequential decision-making tasks.
We propose a closed-loop approach, AdaPlanner, which allows the LLM agent to refine its self-generated plan adaptively in response to environmental feedback.
To mitigate hallucination, we develop a code-style LLM prompt structure that facilitates plan generation across a variety of tasks, environments, and agent capabilities.
arXiv Detail & Related papers (2023-05-26T05:52:27Z) - Understanding the Capabilities of Large Language Models for Automated
Planning [24.37599752610625]
The study seeks to shed light on the capabilities of LLMs in solving complex planning problems.
It provides insights into the most effective approaches for using LLMs in this context.
arXiv Detail & Related papers (2023-05-25T15:21:09Z) - Plansformer: Generating Symbolic Plans using Transformers [24.375997526106246]
Large Language Models (LLMs) have been the subject of active research, significantly advancing the field of Natural Language Processing (NLP)
We introduce Plansformer; an LLM fine-tuned on planning problems and capable of generating plans with favorable behavior in terms of correctness and length with reduced knowledge-engineering efforts.
For one configuration of Plansformer, we achieve 97% valid plans, out of which 95% are optimal for Towers of Hanoi - a puzzle-solving domain.
arXiv Detail & Related papers (2022-12-16T19:06:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.