Related papers: Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning

Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning

URL: http://arxiv.org/abs/2505.16579v1
Date: Thu, 22 May 2025 12:14:23 GMT
Title: Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning
Authors: Siqu Ou, Hongcheng Liu, Pingjie Wang, Yusheng Liao, Chuan Xuan, Yanfeng Wang, Yu Wang,
Abstract summary: We present a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning.<n>Experiments show that augmenting reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches.<n>We propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs.
Score: 18.7712668000592
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.

Related papers

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space [46.05748768260013]
We propose a test-time Dynamic Multimodal Latent Reasoning framework.<n>It employs confidence-guided latent policy gradient optimization to latent think tokens for in-depth reasoning.<n> Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance.
arXiv Detail & Related papers (2025-12-14T10:07:45Z)
Interleaved Latent Visual Reasoning with Selective Perceptual Modeling [42.93438443502933]
Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost.<n>A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off.<n>We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling.
arXiv Detail & Related papers (2025-12-05T12:09:39Z)
Monet: Reasoning in Latent Visual Space Beyond Images and Language [55.424507246294326]
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning.<n>Existing methods fall short of human-like abstract visual thinking.<n>We introduce Monet, a training framework that enables multimodal large language models to reason directly within the latent visual space.
arXiv Detail & Related papers (2025-11-26T13:46:39Z)
OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment [79.98946571424607]
We present OmniBridge, a unified framework that supports vision-language understanding, generation, and retrieval within a unified architecture.<n>To address the challenge of task interference, we propose a two-stage decoupled training strategy.<n>Experiments demonstrate that OmniBridge achieves competitive or state-of-the-art performance in all three tasks.
arXiv Detail & Related papers (2025-09-23T13:57:55Z)
See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model [33.18304419115947]
We introduce SEE&TREK, the first training-free prompting framework to enhance the spatial understanding of Multimodal Large Language Models (MLLM) under vision-only constraints.<n>We focus on increasing visual diversity and motion reconstruction.<n>Our method is training&GPU-free, requiring only a single forward pass, and can be seamlessly integrated into existing MLLMS.
arXiv Detail & Related papers (2025-09-19T15:30:26Z)
Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models [1.9950682531209158]
We propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language.<n>We show that MCOUT consistently improves multimodal reasoning, yielding up to 8.23% accuracy gains over strong baselines.<n>These findings highlight latent continuous reasoning as a promising direction for advancing LMMs beyond language-bound CoT.
arXiv Detail & Related papers (2025-08-18T02:50:20Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z)
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation [22.27973335431714]
We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs)<n>v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process.<n>Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning.
arXiv Detail & Related papers (2025-05-24T19:30:47Z)
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework.<n>V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios.<n>We show V-MAGE provides actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks [24.41705039390567]
EmbodiedVSR (Embodied Visual Spatial Reasoning) is a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning.<n>Our method enables zero-shot spatial reasoning without task-specific fine-tuning.<n>Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence.
arXiv Detail & Related papers (2025-03-14T05:06:07Z)
Zero-Shot Interactive Text-to-Image Retrieval via Diffusion-Augmented Representations [7.439049772394586]
Diffusion Augmented Retrieval (DAR) is a paradigm-shifting framework that bypasses MLLM finetuning entirely.<n>DAR synergizes Large Language Model (LLM)-guided query refinement with Diffusion Model (DM)-based visual synthesis to create contextually enriched intermediate representations.
arXiv Detail & Related papers (2025-01-26T03:29:18Z)
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)<n>We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT)<n>It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z)
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships. We develop the REVISION framework which improves spatial fidelity in vision-language models. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z)
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input. Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints. We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z)
Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM [17.661231232206028]
Simultaneous localization and mapping (SLAM) with implicit neural representations has received extensive attention. We propose a novel SLAM framework for dynamic environments.
arXiv Detail & Related papers (2024-07-18T09:35:48Z)
Improving Vision-and-Language Reasoning via Spatial Relations Modeling [30.477235227733928]
Visual commonsense reasoning (VCR) is a challenging multi-modal task. The proposed method can guide the representations to maintain more spatial context. We achieve the state-of-the-art results on VCR and two other vision-and-language reasoning tasks VQA, and NLVR.
arXiv Detail & Related papers (2023-11-09T11:54:55Z)
DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric Voxelization [67.85434518679382]
We present DynaVol, a 3D scene generative model that unifies geometric structures and object-centric learning. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene. voxel features evolve over time through a canonical-space deformation function, forming the basis for global representation learning.
arXiv Detail & Related papers (2023-04-30T05:29:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.