ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis
- URL: http://arxiv.org/abs/2504.06553v3
- Date: Fri, 11 Apr 2025 12:57:13 GMT
- Title: ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis
- Authors: Yun Chang, Leonor Fermoselle, Duy Ta, Bernadette Bucher, Luca Carlone, Jiuguang Wang,
- Abstract summary: ASHiTA is a framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks.<n>Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks.
- Score: 15.68979922374718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks, a process called hierarchical task analysis, is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis, to generate the task breakdown, with task-driven 3D scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods.
Related papers
- GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering [23.459190671283487]
In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence.<n>We propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs)<n>We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration.
arXiv Detail & Related papers (2024-12-19T03:04:34Z) - Task-oriented Sequential Grounding and Navigation in 3D Scenes [33.740081195089964]
Grounding natural language in 3D environments is a critical step toward achieving robust 3D vision-language alignment.<n>In this work, we introduce a novel task: Task-oriented Sequential Grounding and Navigation in 3D Scenes.<n>We present SG3D, a large-scale dataset comprising 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes.
arXiv Detail & Related papers (2024-08-07T18:30:18Z) - Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis [109.50718968215658]
We propose Forest2Seq, a framework that formulates indoor scene synthesis as an order-aware sequential learning problem.
By employing a clustering-based algorithm and a breadth-first, Forest2Seq derives meaningful orderings and utilizes a transformer to generate realistic 3D scenes autoregressively.
arXiv Detail & Related papers (2024-07-07T14:32:53Z) - Embodied Instruction Following in Unknown Environments [66.60163202450954]
We propose an embodied instruction following (EIF) method for complex tasks in the unknown environment.
We build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller.
For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues.
arXiv Detail & Related papers (2024-06-17T17:55:40Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint.
During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations.
Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Learning Sensorimotor Primitives of Sequential Manipulation Tasks from
Visual Demonstrations [13.864448233719598]
This paper describes a new neural network-based framework for learning simultaneously low-level policies and high-level policies.
A key feature of the proposed approach is that the policies are learned directly from raw videos of task demonstrations.
Empirical results on object manipulation tasks with a robotic arm show that the proposed network can efficiently learn from real visual demonstrations to perform the tasks.
arXiv Detail & Related papers (2022-03-08T01:36:48Z) - Learning Task Decomposition with Ordered Memory Policy Network [73.3813423684999]
We propose Ordered Memory Policy Network (OMPN) to discover subtask hierarchy by learning from demonstration.
OMPN can be applied to partially observable environments and still achieve higher task decomposition performance.
Our visualization confirms that the subtask hierarchy can emerge in our model.
arXiv Detail & Related papers (2021-03-19T18:13:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.