Exploring 3D Activity Reasoning and Planning: From Implicit Human Intentions to Route-Aware Planning
- URL: http://arxiv.org/abs/2503.12974v1
- Date: Mon, 17 Mar 2025 09:33:58 GMT
- Title: Exploring 3D Activity Reasoning and Planning: From Implicit Human Intentions to Route-Aware Planning
- Authors: Xueying Jiang, Wenhao Li, Xiaoqin Zhang, Ling Shao, Shijian Lu,
- Abstract summary: We propose 3D activity reasoning and planning, a novel 3D task that reasons the intended activities from implicit instructions and decomposes them into steps with inter-step routes and planning.<n>First, we construct ReasonPlan3D, a large-scale benchmark that covers diverse 3D scenes with rich implicit instructions.<n>Second, we design a novel framework that introduces progressive plan generation with contextual consistency across multiple steps.
- Score: 103.24305074625106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D activity reasoning and planning has attracted increasing attention in human-robot interaction and embodied AI thanks to the recent advance in multimodal learning. However, most existing works share two constraints: 1) heavy reliance on explicit instructions with little reasoning on implicit user intention; 2) negligence of inter-step route planning on robot moves. To bridge the gaps, we propose 3D activity reasoning and planning, a novel 3D task that reasons the intended activities from implicit instructions and decomposes them into steps with inter-step routes and planning under the guidance of fine-grained 3D object shapes and locations from scene segmentation. We tackle the new 3D task from two perspectives. First, we construct ReasonPlan3D, a large-scale benchmark that covers diverse 3D scenes with rich implicit instructions and detailed annotations for multi-step task planning, inter-step route planning, and fine-grained segmentation. Second, we design a novel framework that introduces progressive plan generation with contextual consistency across multiple steps, as well as a scene graph that is updated dynamically for capturing critical objects and their spatial relations. Extensive experiments demonstrate the effectiveness of our benchmark and framework in reasoning activities from implicit human instructions, producing accurate stepwise task plans, and seamlessly integrating route planning for multi-step moves. The dataset and code will be released.
Related papers
- Multimodal 3D Reasoning Segmentation with Complex Scenes [92.92045550692765]
We bridge the research gaps by proposing a 3D reasoning segmentation task for multiple objects in scenes.
The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects.
In addition, we design MORE3D, a simple yet effective method that enables multi-object 3D reasoning segmentation with user questions and textual outputs.
arXiv Detail & Related papers (2024-11-21T08:22:45Z) - Task-oriented Sequential Grounding and Navigation in 3D Scenes [33.740081195089964]
Grounding natural language in 3D environments is a critical step toward achieving robust 3D vision-language alignment.<n>In this work, we introduce a novel task: Task-oriented Sequential Grounding and Navigation in 3D Scenes.<n>We present SG3D, a large-scale dataset comprising 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes.
arXiv Detail & Related papers (2024-08-07T18:30:18Z) - Task and Motion Planning for Execution in the Real [24.01204729304763]
This work generates task and motion plans that include actions cannot be fully grounded at planning time.
Execution combines offline planned motions and online behaviors till reaching the task goal.
Forty real-robot trials and motivating demonstrations are performed to evaluate the proposed framework.
Results show faster execution time, less number of actions, and more success in problems where diverse gaps arise.
arXiv Detail & Related papers (2024-06-05T22:30:40Z) - OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning [68.45848423501927]
We propose a holistic framework for strong alignment between agent models and 3D driving tasks.
Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D.
We propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model.
arXiv Detail & Related papers (2024-05-02T17:59:24Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and
Planning [125.90002884194838]
ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes.
It is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association.
We demonstrate the utility of this representation through a number of downstream planning tasks.
arXiv Detail & Related papers (2023-09-28T17:53:38Z) - SayPlan: Grounding Large Language Models using 3D Scene Graphs for
Scalable Robot Task Planning [15.346150968195015]
We introduce SayPlan, a scalable approach to large-scale task planning for robotics using 3D scene graph (3DSG) representations.
We evaluate our approach on two large-scale environments spanning up to 3 floors and 36 rooms with 140 assets and objects.
arXiv Detail & Related papers (2023-07-12T12:37:55Z) - Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint.
During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations.
Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z) - TASKOGRAPHY: Evaluating robot task planning over large 3D scene graphs [33.25317860393983]
TASKOGRAPHY is the first large-scale robotic task planning benchmark over 3DSGs.
We propose SCRUB, a task-conditioned 3DSG sparsification method.
We also propose SEEK, a procedure enabling learning-based planners to exploit 3DSG structure.
arXiv Detail & Related papers (2022-07-11T16:51:44Z) - Sequential Manipulation Planning on Scene Graph [90.28117916077073]
We devise a 3D scene graph representation, contact graph+ (cg+), for efficient sequential task planning.
Goal configurations, naturally specified on contact graphs, can be produced by a genetic algorithm with an optimization method.
A task plan is then succinct by computing the Graph Editing Distance (GED) between the initial contact graphs and the goal configurations, which generates graph edit operations corresponding to possible robot actions.
arXiv Detail & Related papers (2022-07-10T02:01:33Z) - Learning to Search in Task and Motion Planning with Streams [20.003445874753233]
Task and motion planning problems in robotics combine symbolic planning over discrete task variables with motion optimization over continuous state and action variables.
We propose a geometrically informed symbolic planner that expands the set of objects and facts in a best-first manner.
We apply our algorithm on a 7DOF robotic arm in block-stacking manipulation tasks.
arXiv Detail & Related papers (2021-11-25T15:58:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.