Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages
- URL: http://arxiv.org/abs/2407.03321v1
- Date: Wed, 3 Jul 2024 17:59:53 GMT
- Title: Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages
- Authors: Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael L. Littman, Stephen H. Bach,
- Abstract summary: We introduce benchmarkName, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks.
We present a dataset of $132,037$ text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty.
- Score: 20.62336315814875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many recent works have explored using language models for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a language model might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by language models by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of $132,037$ text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, $87.6\%$ of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, $82.2\%$ are valid, solve-able problems, but only $35.1\%$ are semantically correct, highlighting the need for a more rigorous benchmark for this problem.
Related papers
- Interactive and Expressive Code-Augmented Planning with Large Language Models [62.799579304821826]
Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making.
Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance.
We propose REPL-Plan, an LLM planning approach that is fully code-expressive and dynamic.
arXiv Detail & Related papers (2024-11-21T04:23:17Z) - Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models [7.3238629831871735]
Large Language Models (LLMs) have shown remarkable performance in various natural language tasks.
Planning problems into the Planning Domain Definition Language (PDDL) has been proposed as a potential solution.
We propose a novel approach that leverages LLMs and environment feedback to automatically generate PDDL domain and problem description files.
arXiv Detail & Related papers (2024-07-17T19:50:51Z) - NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions [8.004470925893957]
We present NL2Plan, the first domain-agnostic offline LLM-driven planning system.
We evaluate NL2Plan on four planning domains and find that it solves 10 out of 15 tasks.
In addition to using NL2Plan in end-to-end mode, users can inspect and correct all of its intermediate results.
arXiv Detail & Related papers (2024-05-07T11:27:13Z) - PROC2PDDL: Open-Domain Planning Representations from Texts [56.627183903841164]
Proc2PDDL is the first dataset containing open-domain procedural texts paired with expert-annotated PDDL representations.
We show that Proc2PDDL is highly challenging, with GPT-3.5's success rate close to 0% and GPT-4's around 35%.
arXiv Detail & Related papers (2024-02-29T19:40:25Z) - Real-World Planning with PDDL+ and Beyond [55.73913765642435]
We present Nyx, a novel PDDL+ planner built to emphasize lightness, simplicity, and, most importantly, adaptability.
Nyx can be tailored to virtually any potential real-world application requiring some form of AI Planning, paving the way for wider adoption of planning methods for solving real-world problems.
arXiv Detail & Related papers (2024-02-19T07:35:49Z) - TIC: Translate-Infer-Compile for accurate "text to plan" using LLMs and Logical Representations [0.0]
We study the problem of generating plans for given natural language planning task requests.
Our approach comprises of (a) translate: using an LLM only for generating a interpretable intermediate representation of natural language task description.
We observe that using an LLM to only output the intermediate representation significantly reduces LLM errors.
arXiv Detail & Related papers (2024-02-09T18:39:13Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - HDDL 2.1: Towards Defining a Formalism and a Semantics for Temporal HTN
Planning [64.07762708909846]
Real world applications need modelling rich and diverse automated planning problems.
hierarchical task network (HTN) formalism does not allow to represent planning problems with numerical and temporal constraints.
We propose to fill the gap between HDDL and these operational needs and to extend HDDL by taking inspiration from PDDL 2.1.
arXiv Detail & Related papers (2023-06-12T18:21:23Z) - Leveraging Pre-trained Large Language Models to Construct and Utilize
World Models for Model-based Task Planning [39.29964085305846]
Methods that use pre-trained large language models directly as planners are currently impractical due to limited correctness of plans.
In this work, we introduce a novel alternative paradigm that constructs an explicit world (domain) model in planning domain definition language (PDDL) and then uses it to plan with sound domain-independent planners.
arXiv Detail & Related papers (2023-05-24T08:59:15Z) - Planning with Complex Data Types in PDDL [2.7412662946127755]
We investigate a modeling language for complex software systems, which supports complex data types such as sets, arrays, records, and unions.
We map this representation further to PDDL to be used with domain-independent PDDL planners.
arXiv Detail & Related papers (2022-12-29T21:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.