Related papers: SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval

SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval

URL: http://arxiv.org/abs/2601.14895v1
Date: Wed, 21 Jan 2026 11:32:24 GMT
Title: SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval
Authors: Xinyi Zheng, Yunze Liu, Chi-Hao Wu, Fan Zhang, Hao Zheng, Wenqi Zhou, Walterio W. Mayol-Cuevas, Junxiao Shen,
Abstract summary: SpatialMem is a memory-centric system that unifies 3D geometry, semantics, and language into a single representation.<n>It reconstructs metrically scaled indoor environments, detects structural 3D anchors, and populates a hierarchical memory with open-vocabulary object nodes.<n>It supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors.
Score: 19.68937683078205
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We present SpatialMem, a memory-centric system that unifies 3D geometry, semantics, and language into a single, queryable representation. Starting from casually captured egocentric RGB video, SpatialMem reconstructs metrically scaled indoor environments, detects structural 3D anchors (walls, doors, windows) as the first-layer scaffold, and populates a hierarchical memory with open-vocabulary object nodes -- linking evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates -- for compact storage and fast retrieval. This design enables interpretable reasoning over spatial relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors. Experiments across three real-life indoor scenes demonstrate that SpatialMem maintains strong anchor-description-level navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, offering an efficient and extensible framework for embodied spatial intelligence.

Related papers

Task-Aware 3D Affordance Segmentation via 2D Guidance and Geometric Refinement [12.260126771415019]
We introduce Task-Aware 3D Scene-level Affordance segmentation (TASA)<n>TASA is a novel geometry-optimized framework that jointly leverages 2D semantic cues and 3D geometric reasoning in a coarse-to-fine manner.<n>To fully exploit 3D geometric information, a 3D affordance refinement module is proposed to integrate 2D semantic priors with local 3D geometry.
arXiv Detail & Related papers (2025-11-12T13:36:37Z)
EAGLE: Episodic Appearance- and Geometry-aware Memory for Unified 2D-3D Visual Query Localization in Egocentric Vision [10.358197274014584]
We present a novel framework that leverages episodic appearance- and geometry-aware memory to achieve unified 2D-3D visual query localization in egocentric vision.<n>Our method achieves state-ofthe-art performance on the Ego4D-VQ benchmark.
arXiv Detail & Related papers (2025-11-11T09:11:21Z)
Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos [69.21508595833623]
Ov3R is a framework for semantic 3D reconstruction from RGB video streams.<n> CLIP3R predicts dense point maps from overlapping clips while embedding object-level semantics.<n>2D-3D OVS lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues.
arXiv Detail & Related papers (2025-07-29T17:55:58Z)
SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z)
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding [78.99798110890157]
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries.<n>Existing language field methods struggle to accurately localize instances using spatial relations in language queries.<n>We propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning.
arXiv Detail & Related papers (2025-07-09T10:20:38Z)
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [54.04601077224252]
Embodied scene understanding requires not only comprehending visual-spatial information but also determining where to explore next in the 3D physical world.<n>underlinetextbf3D vision-language learning enables embodied agents to effectively explore and understand their environment.<n>model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images.
arXiv Detail & Related papers (2025-07-05T14:15:52Z)
Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory [72.75478398447396]
We propose Point3R, an online framework targeting dense streaming 3D reconstruction.<n>To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene.<n>Our method achieves competitive or state-of-the-art performance on various tasks with low training costs.
arXiv Detail & Related papers (2025-07-03T17:59:56Z)
RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation [10.067978300536486]
We develop a zero-shot framework that seamlessly integrates GPU-accelerated geometric reconstruction with open-vocabulary vision-language models.<n>Our training-free system achieves superior performance through incremental processing and unified geometric-semantic updates.
arXiv Detail & Related papers (2025-05-21T11:07:25Z)
3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning [65.40458559619303]
We propose 3D-Mem, a novel 3D scene memory framework for embodied agents.<n>3D-Mem employs informative multi-view images, termed Memory Snapshots, to represent the scene.<n>It further integrates frontier-based exploration by introducing Frontier Snapshots-glimpses of unexplored areas-enabling agents to make informed decisions.
arXiv Detail & Related papers (2024-11-23T09:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.