Generalized Planning in PDDL Domains with Pretrained Large Language
Models
- URL: http://arxiv.org/abs/2305.11014v2
- Date: Mon, 18 Dec 2023 19:44:09 GMT
- Title: Generalized Planning in PDDL Domains with Pretrained Large Language
Models
- Authors: Tom Silver, Soham Dan, Kavitha Srinivas, Joshua B. Tenenbaum, Leslie
Pack Kaelbling, Michael Katz
- Abstract summary: We consider PDDL domains and use GPT-4 to synthesize Python programs.
We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines.
- Score: 82.24479434984426
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has considered whether large language models (LLMs) can function
as planners: given a task, generate a plan. We investigate whether LLMs can
serve as generalized planners: given a domain and training tasks, generate a
program that efficiently produces plans for other tasks in the domain. In
particular, we consider PDDL domains and use GPT-4 to synthesize Python
programs. We also consider (1) Chain-of-Thought (CoT) summarization, where the
LLM is prompted to summarize the domain and propose a strategy in words before
synthesizing the program; and (2) automated debugging, where the program is
validated with respect to the training tasks, and in case of errors, the LLM is
re-prompted with four types of feedback. We evaluate this approach in seven
PDDL domains and compare it to four ablations and four baselines. Overall, we
find that GPT-4 is a surprisingly powerful generalized planner. We also
conclude that automated debugging is very important, that CoT summarization has
non-uniform impact, that GPT-4 is far superior to GPT-3.5, and that just two
training tasks are often sufficient for strong generalization.
Related papers
- NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions [8.004470925893957]
We present NL2Plan, the first domain-agnostic offline LLM-driven planning system.
We evaluate NL2Plan on four planning domains and find that it solves 10 out of 15 tasks.
In addition to using NL2Plan in end-to-end mode, users can inspect and correct all of its intermediate results.
arXiv Detail & Related papers (2024-05-07T11:27:13Z) - PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z) - PROC2PDDL: Open-Domain Planning Representations from Texts [56.627183903841164]
Proc2PDDL is the first dataset containing open-domain procedural texts paired with expert-annotated PDDL representations.
We show that Proc2PDDL is highly challenging, with GPT-3.5's success rate close to 0% and GPT-4's around 35%.
arXiv Detail & Related papers (2024-02-29T19:40:25Z) - Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies [47.129504708849446]
Large Language Models (LLMs) have revolutionized the field of Natural Language Processing.
LLMs lack systematic generalization, which allows to extrapolate the learned statistical regularities outside the training distribution.
In this work, we offer a systematic benchmarking of GPT-4, one of the most advanced LLMs available.
arXiv Detail & Related papers (2024-02-27T10:44:52Z) - TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task.
We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z) - Reformulating Domain Adaptation of Large Language Models as Adapt-Retrieve-Revise: A Case Study on Chinese Legal Domain [32.11522364248498]
GPT-4 can generate content with hallucinations in specific domains such as Chinese law, hindering their application in these areas.
This paper introduces a simple and effective domain adaptation framework for GPT-4 by reformulating generation as an textbfadapt-retrieve-revise process.
In the zero-shot setting of four Chinese legal tasks, our method improves accuracy by 33.3% compared to the direct generation by GPT-4.
arXiv Detail & Related papers (2023-10-05T05:55:06Z) - Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? [49.688233418425995]
Struc-Bench is a comprehensive benchmark featuring prominent Large Language Models (LLMs)
We propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score)
Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains.
arXiv Detail & Related papers (2023-09-16T11:31:58Z) - Leveraging Pre-trained Large Language Models to Construct and Utilize
World Models for Model-based Task Planning [39.29964085305846]
Methods that use pre-trained large language models directly as planners are currently impractical due to limited correctness of plans.
In this work, we introduce a novel alternative paradigm that constructs an explicit world (domain) model in planning domain definition language (PDDL) and then uses it to plan with sound domain-independent planners.
arXiv Detail & Related papers (2023-05-24T08:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.