Related papers: Robotouille: An Asynchronous Planning Benchmark for LLM Agents

Robotouille: An Asynchronous Planning Benchmark for LLM Agents

URL: http://arxiv.org/abs/2502.05227v1
Date: Thu, 06 Feb 2025 05:50:37 GMT
Title: Robotouille: An Asynchronous Planning Benchmark for LLM Agents
Authors: Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, Sanjiban Choudhury,
Abstract summary: Asynchronous planning is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents.<n>We introduce Robotouille, a benchmark environment designed to test agents' ability to handle long-horizon asynchronous scenarios.<n>Our results show that ReAct (gpt4-o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement.
Score: 7.574421886354134
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents. While large language model (LLM) agents show promise in high-level task planning, current benchmarks focus primarily on short-horizon tasks and do not evaluate such asynchronous planning capabilities. We introduce Robotouille, a challenging benchmark environment designed to test LLM agents' ability to handle long-horizon asynchronous scenarios. Our synchronous and asynchronous datasets capture increasingly complex planning challenges that go beyond existing benchmarks, requiring agents to manage overlapping tasks and interruptions. Our results show that ReAct (gpt4-o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement. We further analyze failure modes, demonstrating the need for LLM agents to better incorporate long-horizon feedback and self-audit their reasoning during task execution. Code is available at https://github.com/portal-cornell/robotouille.

Related papers

Optimizing Sequential Multi-Step Tasks with Parallel LLM Agents [15.26802977779826]
M1-Parallel is a framework that concurrently runs multiple multi-agent teams in parallel to uncover distinct solution paths.<n>We show that M1-Parallel with early termination achieves up to $2.2times$ speedup while preserving accuracy.<n>We further investigate strategies aimed at encouraging diverse execution plans but observe no additional performance gains over repeated sampling.
arXiv Detail & Related papers (2025-07-11T18:09:22Z)
VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots [44.99833362998488]
We propose an architecture for automatically verifying high-level task plans before their execution in simulator or real-world environments.<n>The module uses the reasoning capabilities of the Large Language Models to evaluate logical coherence and identify potential gaps in the plan.<n>We contribute to improving the reliability and efficiency of task planning and addresses the critical need for robust pre-execution verification in autonomous systems.
arXiv Detail & Related papers (2025-07-07T15:31:36Z)
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents [60.881609323604685]
Agent Synth is a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets.<n>Our pipeline achieves a low average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations.
arXiv Detail & Related papers (2025-06-17T05:46:52Z)
Exploring GPT-4 for Robotic Agent Strategy with Real-Time State Feedback and a Reactive Behaviour Framework [0.0]
We explore the use of GPT-4 on a humanoid robot in simulation and the real world as proof of concept of a novel large language model (LLM) driven behaviour method. The problem involves prompting the LLM with a goal, and the LLM outputs the sub-tasks to complete to achieve that goal. We propose a method that successfully addresses practical concerns around safety, transitions between tasks, time horizons of tasks and state feedback.
arXiv Detail & Related papers (2025-03-30T21:53:28Z)
REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation [57.628771707989166]
We propose an adaptive multi-agent planning framework, termed REMAC, that enables efficient, scene-agnostic multi-robot long-horizon task planning and execution. ReMAC incorporates two key modules: a self-reflection module performing pre-conditions and post-condition checks in the loop to evaluate progress and refine plans, and a self-evolvement module dynamically adapting plans based on scene-specific reasoning.
arXiv Detail & Related papers (2025-03-28T03:51:40Z)
Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Guided Closed-Loop Feedback [12.600525101342026]
We introduce DAHLIA, a data-agnostic framework for language-conditioned long-horizon robotic manipulation. LLMs are large language models for real-time task planning and execution. Our framework demonstrates state-of-the-art performance across diverse long-horizon tasks, achieving strong generalization in both simulated and real-world scenarios.
arXiv Detail & Related papers (2025-03-27T20:32:58Z)
Haste Makes Waste: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions [56.88110850242265]
We present Recipe2Plan, a novel benchmark framework based on real-world cooking scenarios. Unlike conventional benchmarks, Recipe2Plan challenges agents to optimize cooking time through parallel task execution.
arXiv Detail & Related papers (2025-03-04T03:27:02Z)
Plan-over-Graph: Towards Parallelable LLM Agent Schedule [53.834646147919436]
Large Language Models (LLMs) have demonstrated exceptional abilities in reasoning for task planning. This paper introduces a novel paradigm, plan-over-graph, in which the model first decomposes a real-life textual task into executable subtasks and constructs an abstract task graph. The model then understands this task graph as input and generates a plan for parallel execution.
arXiv Detail & Related papers (2025-02-20T13:47:51Z)
COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models [49.24666980374751]
COHERENT is a novel LLM-based task planning framework for collaboration of heterogeneous multi-robot systems. A Proposal-Execution-Feedback-Adjustment mechanism is designed to decompose and assign actions for individual robots. The experimental results show that our work surpasses the previous methods by a large margin in terms of success rate and execution efficiency.
arXiv Detail & Related papers (2024-09-23T15:53:41Z)
Planning with Multi-Constraints via Collaborative Language Agents [13.550774629515843]
This paper introduces Planning with Multi-Constraints (PMC), a zero-shot methodology for collaborative multi-agent systems.<n>PMC simplifies complex task planning with constraints by decomposing it into a hierarchy of subordinate tasks.<n>PMC achieved an average 42.68% success rate on TravelPlanner, significantly higher than GPT-4 (2.92%), and outperforming GPT-4 with ReAct on API-Bank by 13.64%.
arXiv Detail & Related papers (2024-05-26T10:33:17Z)
Graph-enhanced Large Language Models in Asynchronous Plan Reasoning [18.402877904882107]
We find that large language models (LLMs) behave poorly when not supplied with illustrations about the task-solving process in our benchmark AsyncHow. We propose a novel technique called Plan Like a Graph (PLaG) that combines graphs with natural language prompts and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-02-05T08:26:33Z)
TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation. Specifically, task decomposition, tool selection, and parameter prediction are assessed. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z)
Dynamic Multi-Robot Task Allocation under Uncertainty and Temporal Constraints [52.58352707495122]
We present a multi-robot allocation algorithm that decouples the key computational challenges of sequential decision-making under uncertainty and multi-agent coordination. We validate our results over a wide range of simulations on two distinct domains: multi-arm conveyor belt pick-and-place and multi-drone delivery dispatch in a city.
arXiv Detail & Related papers (2020-05-27T01:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.