Related papers: 3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning

3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning

URL: http://arxiv.org/abs/2411.17735v5
Date: Fri, 04 Apr 2025 06:02:20 GMT
Title: 3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning
Authors: Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, Chuang Gan,
Abstract summary: We propose 3D-Mem, a novel 3D scene memory framework for embodied agents.<n>3D-Mem employs informative multi-view images, termed Memory Snapshots, to represent the scene.<n>It further integrates frontier-based exploration by introducing Frontier Snapshots-glimpses of unexplored areas-enabling agents to make informed decisions.
Score: 65.40458559619303
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Constructing compact and informative 3D scene representations is essential for effective embodied exploration and reasoning, especially in complex environments over extended periods. Existing representations, such as object-centric 3D scene graphs, oversimplify spatial relationships by modeling scenes as isolated objects with restrictive textual relationships, making it difficult to address queries requiring nuanced spatial understanding. Moreover, these representations lack natural mechanisms for active exploration and memory management, hindering their application to lifelong autonomy. In this work, we propose 3D-Mem, a novel 3D scene memory framework for embodied agents. 3D-Mem employs informative multi-view images, termed Memory Snapshots, to represent the scene and capture rich visual information of explored regions. It further integrates frontier-based exploration by introducing Frontier Snapshots-glimpses of unexplored areas-enabling agents to make informed decisions by considering both known and potential new information. To support lifelong memory in active exploration settings, we present an incremental construction pipeline for 3D-Mem, as well as a memory retrieval technique for memory management. Experimental results on three benchmarks demonstrate that 3D-Mem significantly enhances agents' exploration and reasoning capabilities in 3D environments, highlighting its potential for advancing applications in embodied AI.

Related papers

DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction [17.38916914453357]
Predicting the 3D occupancy of large-scale outdoor scenes from 2D images is ill-posed and resource-intensive. We present textbfDGOcc, a textbfGlobal query-based network for monocular 3D textbfOccupancy prediction. The proposed method achieves the best performance on monocular semantic occupancy prediction while reducing GPU and time overhead.
arXiv Detail & Related papers (2025-04-10T07:44:55Z)
Learning 3D Scene Analogies with Neural Contextual Scene Maps [17.545689536966265]
We propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies.
arXiv Detail & Related papers (2025-03-20T06:49:33Z)
FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction [1.8124328823188356]
We focus on detecting and storing objects at a finer resolution, focusing on affordance-relevant parts. We leverage currently available 3D resources to generate 2D data and train a detector, which is then used to augment the standard 3D scene graph generation pipeline.
arXiv Detail & Related papers (2025-03-10T23:13:35Z)
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark [17.94511890272007]
3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. Large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks. We present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs.
arXiv Detail & Related papers (2024-12-10T18:55:23Z)
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences [70.0873383646651]
LSceneLLM is an adaptive framework that automatically identifies task-relevant areas. A dense token selector examines the attention map of LLM to identify visual preferences for the instruction input. An adaptive self-attention module is leveraged to fuse the coarse-grained and selected fine-grained visual information.
arXiv Detail & Related papers (2024-12-02T09:07:57Z)
SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR. SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds. We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z)
Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions. We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells. VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z)
HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting [53.6394928681237]
holistic understanding of urban scenes based on RGB images is a challenging yet important problem. Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians. Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy.
arXiv Detail & Related papers (2024-03-19T13:39:05Z)
MemoNav: Working Memory Model for Visual Navigation [47.011190883888446]
Image-goal navigation is a challenging task that requires an agent to navigate to a goal indicated by an image in unfamiliar environments. Existing methods utilizing diverse scene memories suffer from inefficient exploration since they use all historical observations for decision-making. We present MemoNav, a novel memory model for image-goal navigation, which utilizes a working memory-inspired pipeline to improve navigation performance.
arXiv Detail & Related papers (2024-02-29T13:45:13Z)
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language [31.691159120136064]
We introduce the task of 3D visual grounding in large-scale dynamic scenes based on natural linguistic descriptions and online captured multi-modal visual data. We present a novel method, dubbed WildRefer, for this task by fully utilizing the rich appearance information in images, the position and geometric clues in point cloud. Our datasets are significant for the research of 3D visual grounding in the wild and has huge potential to boost the development of autonomous driving and service robots.
arXiv Detail & Related papers (2023-04-12T06:48:26Z)
Evaluating Long-Term Memory in 3D Mazes [10.224858246626171]
Memory Maze is a 3D domain of randomized mazes designed for evaluating long-term memory in agents. Unlike existing benchmarks, Memory Maze measures long-term memory separate from confounding agent abilities. We find that current algorithms benefit from training with truncated backpropagation through time and succeed on small mazes, but fall short of human performance on the large mazes.
arXiv Detail & Related papers (2022-10-24T16:32:28Z)
AGO-Net: Association-Guided 3D Point Cloud Object Detection Network [86.10213302724085]
We propose a novel 3D detection framework that associates intact features for objects via domain adaptation. We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed.
arXiv Detail & Related papers (2022-08-24T16:54:38Z)
HyperDet3D: Learning a Scene-conditioned 3D Object Detector [154.84798451437032]
We propose HyperDet3D to explore scene-conditioned prior knowledge for 3D object detection. Our HyperDet3D achieves state-of-the-art results on the 3D object detection benchmark of the ScanNet and SUN RGB-D datasets.
arXiv Detail & Related papers (2022-04-12T07:57:58Z)
Hierarchical Representations and Explicit Memory: Learning Effective Navigation Policies on 3D Scene Graphs using Graph Neural Networks [16.19099481411921]
We present a reinforcement learning framework that leverages high-level hierarchical representations to learn navigation policies. For each node in the scene graph, our method uses features that capture occupancy and semantic content, while explicitly retaining memory of the robot trajectory.
arXiv Detail & Related papers (2021-08-02T21:21:27Z)
Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN) It is compartmentalized enough to accurately memorize the percepts during navigation. It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)
End-to-End Egospheric Spatial Memory [32.42361470456194]
We propose a parameter-free module, Egospheric Spatial Memory (ESM), which encodes the memory in an ego-sphere around the agent. ESM can be trained end-to-end via either imitation or reinforcement learning. We show applications to semantic segmentation on the ScanNet dataset, where ESM naturally combines image-level and map-level inference modalities.
arXiv Detail & Related papers (2021-02-15T18:59:07Z)
Occupancy Anticipation for Efficient Exploration and Navigation [97.17517060585875]
We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions. By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment. Our approach is the winning entry in the 2020 Habitat PointNav Challenge.
arXiv Detail & Related papers (2020-08-21T03:16:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.