Pursuing Minimal Sufficiency in Spatial Reasoning
- URL: http://arxiv.org/abs/2510.16688v1
- Date: Sun, 19 Oct 2025 02:29:09 GMT
- Title: Pursuing Minimal Sufficiency in Spatial Reasoning
- Authors: Yejie Guo, Yunzhong Hou, Wufei Ma, Meng Tang, Ming-Hsuan Yang,
- Abstract summary: spatial reasoning, ability to ground language in 3D understanding, remains a persistent challenge for Vision Models.<n>We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D failures, and redundant 3D information.<n>We introduce MS ( spatial spatialer), a dual-agent framework that implements this principle.
- Score: 42.564463357503875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at https://github.com/gyj155/mssr.
Related papers
- Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset [56.533371387182065]
MV-ScanQA is a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views.<n>We present TripAlign, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M 2D view, set of 3D objects, text> triplets.<n>We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign.
arXiv Detail & Related papers (2025-08-14T20:35:59Z) - SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models [9.279591094901152]
SORT3D is an approach that utilizes rich object attributes from 2D data and merges as-based spatial reasoning toolbox with the ability of large language models (LLMs) to perform sequential reasoning.<n>We show that SORT3D state-of-the-art zero-shot performance on complex view-dependent grounding tasks on two benchmarks.<n>We also implement the pipeline to run real-time on two autonomous vehicles and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments.
arXiv Detail & Related papers (2025-04-25T20:24:11Z) - 3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds [81.14476072159049]
3D Affordance detection is a challenging problem with broad applications on various robotic tasks.<n>We reformulate the traditional affordance detection paradigm into textit Reasoning Affordance (IRAS) task.<n>We propose 3D-ADLLM, a framework designed for reasoning affordance detection in 3D open-scene.
arXiv Detail & Related papers (2025-02-27T12:29:44Z) - ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models [57.57832348655715]
We propose a novel zero-shot approach for keypoint detection on 3D shapes.<n>Our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models.
arXiv Detail & Related papers (2024-12-09T08:31:57Z) - VEON: Vocabulary-Enhanced Occupancy Prediction [15.331332063879342]
We propose VEON for Vocabulary-Enhanced Occupancy predictioN by not only assembling but also adapting 2D foundation models.
VEON achieves 15.14 mIoU on Occ3D-nuScenes, and shows the capability of recognizing objects with open-vocabulary categories.
arXiv Detail & Related papers (2024-07-17T03:26:50Z) - FILP-3D: Enhancing 3D Few-shot Class-incremental Learning with Pre-trained Vision-Language Models [59.13757801286343]
Few-shot class-incremental learning aims to mitigate the catastrophic forgetting issue when a model is incrementally trained on limited data.<n>We introduce the FILP-3D framework with two novel components: the Redundant Feature Eliminator (RFE) for feature space misalignment and the Spatial Noise Compensator (SNC) for significant noise.
arXiv Detail & Related papers (2023-12-28T14:52:07Z) - Learning Occupancy for Monocular 3D Object Detection [25.56336546513198]
We propose textbfOccupancyM3D, a method of learning occupancy for monocular 3D detection.
It directly learns occupancy in frustum and 3D space, leading to more discriminative and informative 3D features and representations.
Experiments on KITTI and open datasets demonstrate that the proposed method achieves a new state of the art and surpasses other methods by a significant margin.
arXiv Detail & Related papers (2023-05-25T04:03:46Z) - Attention-Based Depth Distillation with 3D-Aware Positional Encoding for
Monocular 3D Object Detection [10.84784828447741]
ADD is an Attention-based Depth knowledge Distillation framework with 3D-aware positional encoding.
Credit to our teacher design, our framework is seamless, domain-gap free, easily implementable, and is compatible with object-wise ground-truth depth.
We implement our framework on three representative monocular detectors, and we achieve state-of-the-art performance with no additional inference computational cost.
arXiv Detail & Related papers (2022-11-30T06:39:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.