Robotouille: An Asynchronous Planning Benchmark for LLM Agents
- URL: http://arxiv.org/abs/2502.05227v1
- Date: Thu, 06 Feb 2025 05:50:37 GMT
- Title: Robotouille: An Asynchronous Planning Benchmark for LLM Agents
- Authors: Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, Sanjiban Choudhury,
- Abstract summary: Asynchronous planning is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents.
We introduce Robotouille, a benchmark environment designed to test agents' ability to handle long-horizon asynchronous scenarios.
Our results show that ReAct (gpt4-o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement.
- Score: 7.574421886354134
- License:
- Abstract: Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents. While large language model (LLM) agents show promise in high-level task planning, current benchmarks focus primarily on short-horizon tasks and do not evaluate such asynchronous planning capabilities. We introduce Robotouille, a challenging benchmark environment designed to test LLM agents' ability to handle long-horizon asynchronous scenarios. Our synchronous and asynchronous datasets capture increasingly complex planning challenges that go beyond existing benchmarks, requiring agents to manage overlapping tasks and interruptions. Our results show that ReAct (gpt4-o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement. We further analyze failure modes, demonstrating the need for LLM agents to better incorporate long-horizon feedback and self-audit their reasoning during task execution. Code is available at https://github.com/portal-cornell/robotouille.
Related papers
- Plan-over-Graph: Towards Parallelable LLM Agent Schedule [53.834646147919436]
Large Language Models (LLMs) have demonstrated exceptional abilities in reasoning for task planning.
This paper introduces a novel paradigm, plan-over-graph, in which the model first decomposes a real-life textual task into executable subtasks and constructs an abstract task graph.
The model then understands this task graph as input and generates a plan for parallel execution.
arXiv Detail & Related papers (2025-02-20T13:47:51Z) - ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [38.89166693142495]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs)
It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities.
Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z) - Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models [0.0]
We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user interaction.
We context switch regularly to interleave the tasks, which constructs a realistic testing scenario in which we assess the Long-Term Memory, Continual Learning, and Information Integration capabilities of the agents.
arXiv Detail & Related papers (2024-09-30T12:01:29Z) - COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models [49.24666980374751]
COHERENT is a novel LLM-based task planning framework for collaboration of heterogeneous multi-robot systems.
A Proposal-Execution-Feedback-Adjustment mechanism is designed to decompose and assign actions for individual robots.
The experimental results show that our work surpasses the previous methods by a large margin in terms of success rate and execution efficiency.
arXiv Detail & Related papers (2024-09-23T15:53:41Z) - Graph-enhanced Large Language Models in Asynchronous Plan Reasoning [18.402877904882107]
We find that large language models (LLMs) behave poorly when not supplied with illustrations about the task-solving process in our benchmark AsyncHow.
We propose a novel technique called Plan Like a Graph (PLaG) that combines graphs with natural language prompts and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-02-05T08:26:33Z) - TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation.
Specifically, task decomposition, tool selection, and parameter prediction are assessed.
Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z) - LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic
Tabletop Manipulation [38.66406497318709]
This work focuses on the tabletop manipulation task and releases a simulation benchmark, textitLoHoRavens, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference.
We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM.
arXiv Detail & Related papers (2023-10-18T14:53:14Z) - Tree-Planner: Efficient Close-loop Task Planning with Large Language Models [63.06270302774049]
Tree-Planner reframes task planning with Large Language Models into three distinct phases.
Tree-Planner achieves state-of-the-art performance while maintaining high efficiency.
arXiv Detail & Related papers (2023-10-12T17:59:50Z) - Optimal task and motion planning and execution for human-robot
multi-agent systems in dynamic environments [54.39292848359306]
We propose a combined task and motion planning approach to optimize sequencing, assignment, and execution of tasks.
The framework relies on decoupling tasks and actions, where an action is one possible geometric realization of a symbolic task.
We demonstrate the approach effectiveness in a collaborative manufacturing scenario, in which a robotic arm and a human worker shall assemble a mosaic.
arXiv Detail & Related papers (2023-03-27T01:50:45Z) - Dynamic Multi-Robot Task Allocation under Uncertainty and Temporal
Constraints [52.58352707495122]
We present a multi-robot allocation algorithm that decouples the key computational challenges of sequential decision-making under uncertainty and multi-agent coordination.
We validate our results over a wide range of simulations on two distinct domains: multi-arm conveyor belt pick-and-place and multi-drone delivery dispatch in a city.
arXiv Detail & Related papers (2020-05-27T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.