ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying
- URL: http://arxiv.org/abs/2602.02873v1
- Date: Mon, 02 Feb 2026 22:29:57 GMT
- Title: ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying
- Authors: Weihang You, Qingchan Zhu, David Liu, Yi Pan, Geng Yuan, Hanqi Jiang,
- Abstract summary: ViThinker is a framework that enables vision-language models to autonomously generate decision tokens triggering the synthesis of expert-aligned visual features on demand.<n>ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls.
- Score: 15.728211622542267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-Thought (CoT) reasoning excels in language models but struggles in vision-language models due to premature visual-to-text conversion that discards continuous information such as geometry and spatial layout. While recent methods enhance CoT through static enumeration or attention-based selection, they remain passive, i.e., processing pre-computed inputs rather than actively seeking task-relevant details. Inspired by human active perception, we introduce ViThinker, a framework that enables vision-language models to autonomously generate decision (query) tokens triggering the synthesis of expert-aligned visual features on demand. ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls. Through a two-stage curriculum: first distilling frozen experts into model parameters, then learning task-driven querying via sparsity penalties, i.e., ViThinker discovers minimal sufficient perception for each reasoning step. Evaluations across vision-centric benchmarks demonstrate consistent improvements, validating that active query generation outperforms passive approaches in both perceptual grounding and reasoning accuracy.
Related papers
- Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs [60.93949629734977]
We propose Visual Contrastive Self-Taught Reasoner (VC-STaR) to mitigate hallucinations in model-generated rationales.<n>We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR.<n>Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets.
arXiv Detail & Related papers (2026-03-03T03:18:31Z) - LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning [25.571546214219747]
Student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions.<n>We propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings.<n>LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks.
arXiv Detail & Related papers (2026-01-15T07:14:24Z) - Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space [46.05748768260013]
We propose a test-time Dynamic Multimodal Latent Reasoning framework.<n>It employs confidence-guided latent policy gradient optimization to latent think tokens for in-depth reasoning.<n> Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance.
arXiv Detail & Related papers (2025-12-14T10:07:45Z) - ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model [61.29164681694533]
ViPER is a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction.<n>Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability.
arXiv Detail & Related papers (2025-10-28T10:42:57Z) - Think Twice to See More: Iterative Visual Reasoning in Medical VLMs [21.083636394814217]
We introduce ViTAR, a framework that emulates the iterative reasoning process of human experts through a cognitive chain of "think-act-rethink-answer"<n>ViTAR treats medical images as interactive objects, enabling models to engage multi-step visual reasoning.
arXiv Detail & Related papers (2025-10-11T06:39:57Z) - Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z) - Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation [52.6091162517921]
INSIGHT is a two-stage framework for egocentric action anticipation.<n>In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions.<n>In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning.
arXiv Detail & Related papers (2025-08-03T12:52:27Z) - ViRAC: A Vision-Reasoning Agent Head Movement Control Framework in Arbitrary Virtual Environments [0.13654846342364302]
We propose ViRAC, which exploits the common-sense knowledge and reasoning capabilities of large-scale models.<n>ViRAC produces more natural and context-aware head rotations than recent state-of-the-art techniques.
arXiv Detail & Related papers (2025-02-14T09:46:43Z) - Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks.<n>Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results.<n>We propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies.
arXiv Detail & Related papers (2024-05-24T23:09:27Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.