Related papers: Hierarchical Representations and Explicit Memory: Learning Effective Navigation Policies on 3D Scene Graphs using Graph Neural Networks

Hierarchical Representations and Explicit Memory: Learning Effective Navigation Policies on 3D Scene Graphs using Graph Neural Networks

URL: http://arxiv.org/abs/2108.01176v1
Date: Mon, 2 Aug 2021 21:21:27 GMT
Title: Hierarchical Representations and Explicit Memory: Learning Effective Navigation Policies on 3D Scene Graphs using Graph Neural Networks
Authors: Zachary Ravichandran, Lisa Peng, Nathan Hughes, J. Daniel Griffith, Luca Carlone
Abstract summary: We present a reinforcement learning framework that leverages high-level hierarchical representations to learn navigation policies. For each node in the scene graph, our method uses features that capture occupancy and semantic content, while explicitly retaining memory of the robot trajectory.
Score: 16.19099481411921
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Representations are crucial for a robot to learn effective navigation policies. Recent work has shown that mid-level perceptual abstractions, such as depth estimates or 2D semantic segmentation, lead to more effective policies when provided as observations in place of raw sensor data (e.g., RGB images). However, such policies must still learn latent three-dimensional scene properties from mid-level abstractions. In contrast, high-level, hierarchical representations such as 3D scene graphs explicitly provide a scene's geometry, topology, and semantics, making them compelling representations for navigation. In this work, we present a reinforcement learning framework that leverages high-level hierarchical representations to learn navigation policies. Towards this goal, we propose a graph neural network architecture and show how to embed a 3D scene graph into an agent-centric feature space, which enables the robot to learn policies for low-level action in an end-to-end manner. For each node in the scene graph, our method uses features that capture occupancy and semantic content, while explicitly retaining memory of the robot trajectory. We demonstrate the effectiveness of our method against commonly used visuomotor policies in a challenging object search task. These experiments and supporting ablation studies show that our method leads to more effective object search behaviors, exhibits improved long-term memory, and successfully leverages hierarchical information to guide its navigation objectives.

Related papers

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [54.04601077224252]
Embodied scene understanding requires not only comprehending visual-spatial information but also determining where to explore next in the 3D physical world.<n>underlinetextbf3D vision-language learning enables embodied agents to effectively explore and understand their environment.<n>model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images.
arXiv Detail & Related papers (2025-07-05T14:15:52Z)
FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction [1.8124328823188356]
We focus on detecting and storing objects at a finer resolution, focusing on affordance-relevant parts. We leverage currently available 3D resources to generate 2D data and train a detector, which is then used to augment the standard 3D scene graph generation pipeline.
arXiv Detail & Related papers (2025-03-10T23:13:35Z)
Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation. Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception. The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z)
Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning [55.517000360348725]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy. Experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning.
arXiv Detail & Related papers (2023-12-01T15:47:04Z)
Object Goal Navigation with Recursive Implicit Maps [92.6347010295396]
We propose an implicit spatial map for object goal navigation. Our method significantly outperforms the state of the art on the challenging MP3D dataset. We deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes.
arXiv Detail & Related papers (2023-08-10T14:21:33Z)
How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers [94.46825166907831]
We present a training-free solution to tackle the object goal navigation problem in Embodied AI. Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework. Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers.
arXiv Detail & Related papers (2023-05-26T13:38:33Z)
3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification [19.125633699422117]
We propose a framework for 3D-aware ObjectNav based on two straightforward sub-policies. Our framework achieves the best performance among all modular-based methods on the Matterport3D and Gibson datasets.
arXiv Detail & Related papers (2022-12-01T07:55:56Z)
SEAL: Self-supervised Embodied Active Learning using Exploration and 3D Consistency [122.18108118190334]
We present a framework called Self- Embodied Embodied Active Learning (SEAL) It utilizes perception models trained on internet images to learn an active exploration policy. We and build utilize 3D semantic maps to learn both action and perception in a completely self-supervised manner.
arXiv Detail & Related papers (2021-12-02T06:26:38Z)
Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video. Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training. Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z)
SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots. Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step. This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z)
MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation [4.127128889779478]
This work focuses on performing better or comparable to the existing learning-based solutions for visual navigation for autonomous agents. We propose a method to encode vital scene semantics into a semantically informed, top-down egocentric map representation. We conduct experiments on 3-D reconstructed indoor PointGoal visual navigation and demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-03-21T12:01:23Z)
Where2Act: From Pixels to Actions for Articulated 3D Objects [54.19638599501286]
We extract highly localized actionable information related to elementary actions such as pushing or pulling for articulated objects with movable parts. We propose a learning-from-interaction framework with an online data sampling strategy that allows us to train the network in simulation. Our learned models even transfer to real-world data.
arXiv Detail & Related papers (2021-01-07T18:56:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.