SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
- URL: http://arxiv.org/abs/2505.12448v2
- Date: Wed, 21 May 2025 03:34:31 GMT
- Title: SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
- Authors: Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, Donglin Wang,
- Abstract summary: We propose a novel framework that transforms raw depth data into structured, interpretable textual rationales.<n>These textual rationales serve as meaningful intermediate representations to significantly enhance spatial reasoning capabilities.<n>We introduce a new dataset named SSR-CoT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations.
- Score: 34.31268708448338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either require specialized sensors or fail to effectively exploit depth information for higher-order reasoning. To this end, we propose a novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that transforms raw depth data into structured, interpretable textual rationales. These textual rationales serve as meaningful intermediate representations to significantly enhance spatial reasoning capabilities. Additionally, we leverage knowledge distillation to compress the generated rationales into compact latent embeddings, which facilitate resource-efficient and plug-and-play integration into existing VLMs without retraining. To enable comprehensive evaluation, we introduce a new dataset named SSR-CoT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBench, a comprehensive multi-task benchmark. Extensive experiments on multiple benchmarks demonstrate SSR substantially improves depth utilization and enhances spatial reasoning, thereby advancing VLMs toward more human-like multi-modal understanding. Our project page is at https://yliu-cs.github.io/SSR.
Related papers
- SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks [53.611256895338585]
We introduce SIRI-Bench, a benchmark designed to evaluate Vision-Language Models' spatial intelligence through video-based reasoning tasks.<n> SIRI-Bench comprises nearly 1K video-question-answer triplets, where each problem is embedded in a realistic 3D scene and captured by video.<n>To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine.
arXiv Detail & Related papers (2025-06-17T13:40:00Z) - Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning [78.17782197231325]
We propose a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective.<n> Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance.
arXiv Detail & Related papers (2025-06-05T02:28:07Z) - SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z) - Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z) - EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks [24.41705039390567]
EmbodiedVSR (Embodied Visual Spatial Reasoning) is a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning.<n>Our method enables zero-shot spatial reasoning without task-specific fine-tuning.<n>Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence.
arXiv Detail & Related papers (2025-03-14T05:06:07Z) - Leveraging Stable Diffusion for Monocular Depth Estimation via Image Semantic Encoding [1.0445560141983634]
We propose a novel image-based semantic embedding that extracts contextual information directly from visual features.<n>Our method achieves performance comparable to state-of-the-art models while addressing the shortcomings of CLIP embeddings in handling outdoor scenes.
arXiv Detail & Related papers (2025-02-01T15:37:22Z) - Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking [37.98711638929805]
We introduce a novel, cost-efficient paradigm that significantly advances sensor image understanding.<n>We propose Sensor-Aware Attributes Fine-Tuning (SAFT) with the Diverse Negative Attributes (DNA) optimization.<n>We present VS-TDX-the first comprehensive, public benchmark designed to rigorously evaluate VLMs' sensor-specific understanding.
arXiv Detail & Related papers (2024-12-30T06:44:25Z) - Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation [15.302043040651368]
Navigating unseen environments based on natural language instructions remains difficult for egocentric agents.<n>We propose a versatile Semantic Understanding and Spatial Awareness architecture to encourage agents to ground environment from diverse perspectives.<n> Experiments demonstrate that SUSA's hybrid semantic-spatial representations effectively enhance navigation performance.
arXiv Detail & Related papers (2024-12-09T13:10:28Z) - DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness [34.170341753045776]
Document Visual Question Answering (VQA) demands robust integration of text detection, recognition, and spatial reasoning.<n>We introduce DLaVA, a training-free pipeline that leverages Multimodal Large Language Models (MLLMs) for zero-shot answer localization.
arXiv Detail & Related papers (2024-11-29T06:17:11Z) - LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation [21.91073335335992]
We introduce LHRS-Bot-Nova, an MLLM specialized in understanding remote sensing (RS) images.
LHRS-Bot-Nova features an enhanced vision encoder and a novel bridge layer, enabling efficient visual compression and better language-vision alignment.
Extensive experiments demonstrate superior performance of LHRS-Bot-Nova across various RS image understanding tasks.
arXiv Detail & Related papers (2024-11-14T09:23:40Z) - SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models [68.13636352687257]
We introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities.
During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances.
Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts.
arXiv Detail & Related papers (2024-06-03T17:59:06Z) - Large Language Models for Information Retrieval: A Survey [58.30439850203101]
Information retrieval has evolved from term-based methods to its integration with advanced neural models.
Recent research has sought to leverage large language models (LLMs) to improve IR systems.
We delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers.
arXiv Detail & Related papers (2023-08-14T12:47:22Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - High-resolution Depth Maps Imaging via Attention-based Hierarchical
Multi-modal Fusion [84.24973877109181]
We propose a novel attention-based hierarchical multi-modal fusion network for guided DSR.
We show that our approach outperforms state-of-the-art methods in terms of reconstruction accuracy, running speed and memory efficiency.
arXiv Detail & Related papers (2021-04-04T03:28:33Z) - Accurate RGB-D Salient Object Detection via Collaborative Learning [101.82654054191443]
RGB-D saliency detection shows impressive ability on some challenge scenarios.
We propose a novel collaborative learning framework where edge, depth and saliency are leveraged in a more efficient way.
arXiv Detail & Related papers (2020-07-23T04:33:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.