V-Thinker: Interactive Thinking with Images
- URL: http://arxiv.org/abs/2511.04460v1
- Date: Thu, 06 Nov 2025 15:32:29 GMT
- Title: V-Thinker: Interactive Thinking with Images
- Authors: Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Honggang Zhang,
- Abstract summary: Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for Large Multimodal Models (LMMs)<n>We present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning.<n>V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios.
- Score: 22.55079103487787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.
Related papers
- Chatting with Images for Introspective Visual Thinking [50.7747647794877]
''Chatting with images'' is a new framework that reframes visual manipulation as language-guided feature modulation.<n>Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions.<n>ViLaVT achieves strong and consistent improvements on complex multi-image and video-based spatial reasoning tasks.
arXiv Detail & Related papers (2026-02-11T17:42:37Z) - VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization [87.26383908243878]
We show that vision encoders within Multimodal Large Language Models exhibit deficiencies in their dense feature representations.<n>We propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training.
arXiv Detail & Related papers (2026-02-10T16:08:19Z) - Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space [46.05748768260013]
We propose a test-time Dynamic Multimodal Latent Reasoning framework.<n>It employs confidence-guided latent policy gradient optimization to latent think tokens for in-depth reasoning.<n> Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance.
arXiv Detail & Related papers (2025-12-14T10:07:45Z) - Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning [6.800544911407401]
GRiP (Guided Reasoning and Perception) is a novel training framework for visual grounded reasoning.<n> GRiP cultivates robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways.<n> GRiP achieves state-of-the-art results among open-source models on the highly challenging TreeBench and V* Bench.
arXiv Detail & Related papers (2025-11-27T07:18:25Z) - ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning [76.95203056566191]
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought.<n>We build ThinkMorph, a unified model fine-tuned on approximately 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement.<n>ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic.
arXiv Detail & Related papers (2025-10-30T17:51:38Z) - Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers [90.4459196223986]
A similar evolution is now unfolding in AI, marking a paradigm shift from models that merely think about images to those that can truly think with images.<n>This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace.
arXiv Detail & Related papers (2025-06-30T14:48:35Z) - Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z) - Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning [27.511627003202538]
Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes.<n>This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes.<n>We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components.
arXiv Detail & Related papers (2025-05-14T04:04:23Z) - Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains [31.828341309787042]
Vision-language models (VLMs) achieve remarkable success in single-image tasks.<n>Real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline.<n>We propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs'perception, comprehension, and reasoning abilities in multi-image scenarios.
arXiv Detail & Related papers (2025-04-28T19:02:18Z) - CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation [12.008690947774015]
We propose a multi-step reasoning framework that mimics human-like "slow thinking" for multi-image understanding.<n>The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, as supervisory signals.<n>The introduction of a test-time memory augmentation module that expands the model reasoning capacity during inference.
arXiv Detail & Related papers (2025-03-07T09:13:17Z) - A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z) - Question Aware Vision Transformer for Multimodal Reasoning [14.188369270753347]
We introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal reasoning.
It embeds question awareness directly within the vision encoder.
This integration results in dynamic visual features focusing on relevant image aspects to the posed question.
arXiv Detail & Related papers (2024-02-08T08:03:39Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.