SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval
- URL: http://arxiv.org/abs/2601.14895v1
- Date: Wed, 21 Jan 2026 11:32:24 GMT
- Title: SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval
- Authors: Xinyi Zheng, Yunze Liu, Chi-Hao Wu, Fan Zhang, Hao Zheng, Wenqi Zhou, Walterio W. Mayol-Cuevas, Junxiao Shen,
- Abstract summary: SpatialMem is a memory-centric system that unifies 3D geometry, semantics, and language into a single representation.<n>It reconstructs metrically scaled indoor environments, detects structural 3D anchors, and populates a hierarchical memory with open-vocabulary object nodes.<n>It supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors.
- Score: 19.68937683078205
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present SpatialMem, a memory-centric system that unifies 3D geometry, semantics, and language into a single, queryable representation. Starting from casually captured egocentric RGB video, SpatialMem reconstructs metrically scaled indoor environments, detects structural 3D anchors (walls, doors, windows) as the first-layer scaffold, and populates a hierarchical memory with open-vocabulary object nodes -- linking evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates -- for compact storage and fast retrieval. This design enables interpretable reasoning over spatial relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors. Experiments across three real-life indoor scenes demonstrate that SpatialMem maintains strong anchor-description-level navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, offering an efficient and extensible framework for embodied spatial intelligence.
Related papers
- Task-Aware 3D Affordance Segmentation via 2D Guidance and Geometric Refinement [12.260126771415019]
We introduce Task-Aware 3D Scene-level Affordance segmentation (TASA)<n>TASA is a novel geometry-optimized framework that jointly leverages 2D semantic cues and 3D geometric reasoning in a coarse-to-fine manner.<n>To fully exploit 3D geometric information, a 3D affordance refinement module is proposed to integrate 2D semantic priors with local 3D geometry.
arXiv Detail & Related papers (2025-11-12T13:36:37Z) - EAGLE: Episodic Appearance- and Geometry-aware Memory for Unified 2D-3D Visual Query Localization in Egocentric Vision [10.358197274014584]
We present a novel framework that leverages episodic appearance- and geometry-aware memory to achieve unified 2D-3D visual query localization in egocentric vision.<n>Our method achieves state-ofthe-art performance on the Ego4D-VQ benchmark.
arXiv Detail & Related papers (2025-11-11T09:11:21Z) - Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos [69.21508595833623]
Ov3R is a framework for semantic 3D reconstruction from RGB video streams.<n> CLIP3R predicts dense point maps from overlapping clips while embedding object-level semantics.<n>2D-3D OVS lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues.
arXiv Detail & Related papers (2025-07-29T17:55:58Z) - SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z) - A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding [78.99798110890157]
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries.<n>Existing language field methods struggle to accurately localize instances using spatial relations in language queries.<n>We propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning.
arXiv Detail & Related papers (2025-07-09T10:20:38Z) - Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [54.04601077224252]
Embodied scene understanding requires not only comprehending visual-spatial information but also determining where to explore next in the 3D physical world.<n>underlinetextbf3D vision-language learning enables embodied agents to effectively explore and understand their environment.<n>model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images.
arXiv Detail & Related papers (2025-07-05T14:15:52Z) - Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory [72.75478398447396]
We propose Point3R, an online framework targeting dense streaming 3D reconstruction.<n>To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene.<n>Our method achieves competitive or state-of-the-art performance on various tasks with low training costs.
arXiv Detail & Related papers (2025-07-03T17:59:56Z) - RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation [10.067978300536486]
We develop a zero-shot framework that seamlessly integrates GPU-accelerated geometric reconstruction with open-vocabulary vision-language models.<n>Our training-free system achieves superior performance through incremental processing and unified geometric-semantic updates.
arXiv Detail & Related papers (2025-05-21T11:07:25Z) - 3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning [65.40458559619303]
We propose 3D-Mem, a novel 3D scene memory framework for embodied agents.<n>3D-Mem employs informative multi-view images, termed Memory Snapshots, to represent the scene.<n>It further integrates frontier-based exploration by introducing Frontier Snapshots-glimpses of unexplored areas-enabling agents to make informed decisions.
arXiv Detail & Related papers (2024-11-23T09:57:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.