LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied
Agents
- URL: http://arxiv.org/abs/2402.08178v1
- Date: Tue, 13 Feb 2024 02:28:57 GMT
- Title: LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied
Agents
- Authors: Jae-Woo Choi and Youngwoo Yoon and Hyobin Ong and Jaehong Kim and
Minsu Jang
- Abstract summary: Large language models (LLMs) have recently received considerable attention as alternative solutions for task planning.
We propose a benchmark system for automatically quantifying performance of task planning for home-service embodied agents.
- Score: 2.8927500190704567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have recently received considerable attention as
alternative solutions for task planning. However, comparing the performance of
language-oriented task planners becomes difficult, and there exists a dearth of
detailed exploration regarding the effects of various factors such as
pre-trained model selection and prompt construction. To address this, we
propose a benchmark system for automatically quantifying performance of task
planning for home-service embodied agents. Task planners are tested on two
pairs of datasets and simulators: 1) ALFRED and AI2-THOR, 2) an extension of
Watch-And-Help and VirtualHome. Using the proposed benchmark system, we perform
extensive experiments with LLMs and prompts, and explore several enhancements
of the baseline planner. We expect that the proposed benchmark tool would
accelerate the development of language-oriented task planners.
Related papers
- Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
We construct a benchmark suite encompassing both classical planning domains and natural language scenarios.
Second, we investigate the use of in-context learning (ICL) to enhance LLM planning, exploring the direct relationship between increased context length and improved planning performance.
Third, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths, as well as the effectiveness of incorporating model-driven search procedures.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - Ask-before-Plan: Proactive Language Agents for Real-World Planning [68.08024918064503]
Proactive Agent Planning requires language agents to predict clarification needs based on user-agent conversation and agent-environment interaction.
We propose a novel multi-agent framework, Clarification-Execution-Planning (textttCEP), which consists of three agents specialized in clarification, execution, and planning.
arXiv Detail & Related papers (2024-06-18T14:07:28Z) - TravelPlanner: A Benchmark for Real-World Planning with Language Agents [63.199454024966506]
We propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario.
It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans.
Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%.
arXiv Detail & Related papers (2024-02-02T18:39:51Z) - Learning adaptive planning representations with natural language
guidance [90.24449752926866]
This paper describes Ada, a framework for automatically constructing task-specific planning representations.
Ada interactively learns a library of planner-compatible high-level action abstractions and low-level controllers adapted to a particular domain of planning tasks.
arXiv Detail & Related papers (2023-12-13T23:35:31Z) - TaskBench: Benchmarking Large Language Models for Task Automation [85.3879908356586]
We introduce TaskBench to evaluate the capability of large language models in task automation.
To generate high-quality evaluation datasets, we introduce the concept of Tool Graph.
We also propose TaskEval to evaluate the capability of LLMs from different aspects, including task decomposition, tool invocation, and parameter prediction.
arXiv Detail & Related papers (2023-11-30T18:02:44Z) - TPTU: Large Language Model-based AI Agents for Task Planning and Tool
Usage [28.554981886052953]
Large Language Models (LLMs) have emerged as powerful tools for various real-world applications.
Despite their prowess, intrinsic generative abilities of LLMs may prove insufficient for handling complex tasks.
This paper proposes a structured framework tailored for LLM-based AI Agents.
arXiv Detail & Related papers (2023-08-07T09:22:03Z) - Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint.
During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations.
Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z) - Learning to Reason over Scene Graphs: A Case Study of Finetuning GPT-2
into a Robot Language Model for Grounded Task Planning [45.51792981370957]
We investigate the applicability of a smaller class of large language models (LLMs) in robotic task planning by learning to decompose tasks into subgoal specifications for a planner to execute sequentially.
Our method grounds the input of the LLM on the domain that is represented as a scene graph, enabling it to translate human requests into executable robot plans.
Our findings suggest that the knowledge stored in an LLM can be effectively grounded to perform long-horizon task planning, demonstrating the promising potential for the future application of neuro-symbolic planning methods in robotics.
arXiv Detail & Related papers (2023-05-12T18:14:32Z) - PlanBench: An Extensible Benchmark for Evaluating Large Language Models
on Planning and Reasoning about Change [34.93870615625937]
PlanBench is a benchmark suite based on the kinds of domains used in the automated planning community.
PlanBench provides sufficient diversity in both the task domains and the specific planning capabilities.
arXiv Detail & Related papers (2022-06-21T16:15:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.