Related papers: Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

URL: http://arxiv.org/abs/2511.19430v1
Date: Mon, 24 Nov 2025 18:59:17 GMT
Title: Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution
Authors: Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, Xiang Bai,
Abstract summary: Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D) is a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization.<n>To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes.<n>Experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency.
Score: 51.89342880214462
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT

Related papers

VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement [66.13644883379087]
We tackle three key challenges in 3D object arrangement task using MLLMs.<n>First, to address the weak visual grounding of MLLMs, we introduce an MCP-based API.<n>Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools.<n>Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework.
arXiv Detail & Related papers (2025-12-26T19:22:39Z)
3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning [2.6670748466660523]
Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks.<n>VLMs lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations.<n>We propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs.
arXiv Detail & Related papers (2025-02-13T02:40:19Z)
S2O: Static to Openable Enhancement for Articulated 3D Objects [20.310491257189422]
We introduce the static to openable (S2O) task which creates interactive articulated 3D objects from static counterparts.<n>Our work enables efficient creation of interactive 3D objects for robotic manipulation and embodied AI tasks.
arXiv Detail & Related papers (2024-09-27T16:34:13Z)
Task-oriented Sequential Grounding and Navigation in 3D Scenes [33.740081195089964]
Grounding natural language in 3D environments is a critical step toward achieving robust 3D vision-language alignment.<n>In this work, we introduce a novel task: Task-oriented Sequential Grounding and Navigation in 3D Scenes.<n>We present SG3D, a large-scale dataset comprising 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes.
arXiv Detail & Related papers (2024-08-07T18:30:18Z)
RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks. RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model. Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z)
A Unified Framework for 3D Scene Understanding [50.6762892022386]
UniSeg3D is a unified 3D scene understanding framework.<n>It achieves panoptic, semantic, instance, interactive, referring, and open-vocabulary segmentation tasks within a single model.
arXiv Detail & Related papers (2024-07-03T16:50:07Z)
Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z)
Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization [51.33923845954759]
3D Visual Grounding (3DVG) and 3D Captioning (3DDC) are two crucial tasks in various 3D applications.<n>We propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks.<n>In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection.
arXiv Detail & Related papers (2024-04-17T04:46:27Z)
3D-GPT: Procedural 3D Modeling with Large Language Models [47.72968643115063]
We introduce 3D-GPT, a framework utilizing large language models(LLMs) for instruction-driven 3D modeling. 3D-GPT positions LLMs as proficient problem solvers, dissecting the procedural 3D modeling tasks into accessible segments and appointing the apt agent for each task. Our empirical investigations confirm that 3D-GPT not only interprets and executes instructions, delivering reliable results but also collaborates effectively with human designers.
arXiv Detail & Related papers (2023-10-19T17:41:48Z)
LiDAR-BEVMTN: Real-Time LiDAR Bird's-Eye View Multi-Task Perception Network for Autonomous Driving [12.713417063678335]
We present a real-time multi-task convolutional neural network for LiDAR-based object detection, semantics, and motion segmentation. We propose a novel Semantic Weighting and Guidance (SWAG) module to transfer semantic features for improved object detection selectively. We achieve state-of-the-art results for two tasks, semantic and motion segmentation, and close to state-of-the-art performance for 3D object detection.
arXiv Detail & Related papers (2023-07-17T21:22:17Z)
A Simple and Efficient Multi-task Network for 3D Object Detection and Road Understanding [20.878931360708343]
We show that it is possible to perform all perception tasks via a simple and efficient multi-task network. Our proposed network, LidarMTL, takes raw LiDAR point cloud as inputs, and predicts six perception outputs for 3D object detection and road understanding.
arXiv Detail & Related papers (2021-03-06T08:00:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.