Related papers: Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps

Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps

URL: http://arxiv.org/abs/2601.11442v1
Date: Fri, 16 Jan 2026 17:02:46 GMT
Title: Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps
Authors: Xiangjun Gao, Zhensong Zhang, Dave Zhenyu Chen, Songcen Xu, Long Quan, Eduardo Pérez-Pellitero, Youngkyoon Jang,
Abstract summary: Map2Thought is a framework that enables explicit and interpretable spatial reasoning for 3D VLMs.<n>Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT) are key components of the framework.<n>We show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision.
Score: 35.51348819617679
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the full dataset. It consistently outperforms state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.

Related papers

Volumetric Semantically Consistent 3D Panoptic Mapping [77.13446499924977]
We introduce an online 2D-to-3D semantic instance mapping algorithm aimed at generating semantic 3D maps suitable for autonomous agents in unstructured environments. It introduces novel ways of integrating semantic prediction confidence during mapping, producing semantic and instance-consistent 3D regions. The proposed method achieves accuracy superior to the state of the art on public large-scale datasets, improving on a number of widely used metrics.
arXiv Detail & Related papers (2023-09-26T08:03:10Z)
Neural Semantic Surface Maps [52.61017226479506]
We present an automated technique for computing a map between two genus-zero shapes, which matches semantically corresponding regions to one another. Our approach can generate semantic surface-to-surface maps, eliminating manual annotations or any 3D training data requirement.
arXiv Detail & Related papers (2023-09-09T16:21:56Z)
PointOcc: Cylindrical Tri-Perspective View for Point-based 3D Semantic Occupancy Prediction [72.75478398447396]
We propose a cylindrical tri-perspective view to represent point clouds effectively and comprehensively. Considering the distance distribution of LiDAR point clouds, we construct the tri-perspective view in the cylindrical coordinate system. We employ spatial group pooling to maintain structural details during projection and adopt 2D backbones to efficiently process each TPV plane.
arXiv Detail & Related papers (2023-08-31T17:57:17Z)
SATR: Zero-Shot Semantic Segmentation of 3D Shapes [74.08209893396271]
We explore the task of zero-shot semantic segmentation of 3D shapes by using large-scale off-the-shelf 2D image recognition models. We develop the Assignment with Topological Reweighting (SATR) algorithm and evaluate it on ShapeNetPart and our proposed FAUST benchmarks. SATR achieves state-of-the-art performance and outperforms a baseline algorithm by 1.3% and 4% average mIoU.
arXiv Detail & Related papers (2023-04-11T00:43:16Z)
Contour Context: Abstract Structural Distribution for 3D LiDAR Loop Detection and Metric Pose Estimation [31.968749056155467]
This paper proposes a simple, effective, and efficient topological loop closure detection pipeline with accurate 3-DoF metric pose estimation. We interpret the Cartesian birds' eye view (BEV) image projected from 3D LiDAR points as layered distribution of structures. A retrieval key is designed to accelerate the search of a database indexed by layered KD-trees.
arXiv Detail & Related papers (2023-02-13T07:18:24Z)
SketchSampler: Sketch-based 3D Reconstruction via View-dependent Depth Sampling [75.957103837167]
Reconstructing a 3D shape based on a single sketch image is challenging due to the large domain gap between a sparse, irregular sketch and a regular, dense 3D shape. Existing works try to employ the global feature extracted from sketch to directly predict the 3D coordinates, but they usually suffer from losing fine details that are not faithful to the input sketch.
arXiv Detail & Related papers (2022-08-14T16:37:51Z)
BoxGraph: Semantic Place Recognition and Pose Estimation from 3D LiDAR [22.553026961366005]
We model 3D point clouds as fully-connected graphs of semantically identified components. Optimal association across graphs allows for full 6-Degree-of-Freedom (DoF) pose estimation and place recognition. This representation is very concise, condensing the size of maps by a factor of 25 against the state-of-the-art.
arXiv Detail & Related papers (2022-06-30T09:39:08Z)
Improving Lidar-Based Semantic Segmentation of Top-View Grid Maps by Learning Features in Complementary Representations [3.0413873719021995]
We introduce a novel way to predict semantic information from sparse, single-shot LiDAR measurements in the context of autonomous driving. The approach is aimed specifically at improving the semantic segmentation of top-view grid maps. For each representation a tailored deep learning architecture is developed to effectively extract semantic information.
arXiv Detail & Related papers (2022-03-02T14:49:51Z)
UACANet: Uncertainty Augmented Context Attention for Polyp Semgnetaion [12.089183640843416]
We construct a modified version of U-Net shape network with additional encoder and decoder. In each prediction module, previously predicted saliency map is utilized to compute foreground, background and uncertain area map. We achieve 76.6% mean Dice on ETIS dataset which is 13.8% improvement compared to the previous state-of-the-art method.
arXiv Detail & Related papers (2021-07-06T03:11:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.