LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
- URL: http://arxiv.org/abs/2601.10129v1
- Date: Thu, 15 Jan 2026 07:14:24 GMT
- Title: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
- Authors: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung,
- Abstract summary: Student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions.<n>We propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings.<n>LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks.
- Score: 25.571546214219747
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.
Related papers
- Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs [60.93949629734977]
We propose Visual Contrastive Self-Taught Reasoner (VC-STaR) to mitigate hallucinations in model-generated rationales.<n>We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR.<n>Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets.
arXiv Detail & Related papers (2026-03-03T03:18:31Z) - ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying [15.728211622542267]
ViThinker is a framework that enables vision-language models to autonomously generate decision tokens triggering the synthesis of expert-aligned visual features on demand.<n>ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls.
arXiv Detail & Related papers (2026-02-02T22:29:57Z) - Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy [75.66913260900726]
Reinforcement Learning with Verifiable Rewards has significantly advanced reasoning capabilities in Large Language Models.<n>Existing paradigms, driven by text-centric outcome rewards, encourage models to bypass visual perception.<n>We propose textbfThinking with Deltas, a framework driven by a textbfDifferential Visual Reasoning Policy.
arXiv Detail & Related papers (2026-01-11T08:25:34Z) - Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning [79.34909830834464]
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments.<n>We show that visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance.<n>We propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level.
arXiv Detail & Related papers (2025-09-08T09:20:04Z) - ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought [83.89629325805505]
We introduce Argus to address limitations with a new visual attention grounding mechanism.<n>Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention.
arXiv Detail & Related papers (2025-05-29T17:59:56Z) - Visual Abstract Thinking Empowers Multimodal Reasoning [11.70318717106245]
Images usually convey richer detail than text, but often include redundant information which downgrades multimodal reasoning performance.<n>Inspired by this cognitive strategy, we introduce Visual Abstract Thinking (VAT)<n>VAT prompts Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance.<n> Experimental results show that VAT consistently empowers different models, and achieves an average gain of 17% over GPT-4o baseline.
arXiv Detail & Related papers (2025-05-26T16:06:35Z) - A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z) - Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects [30.09778169168547]
Vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings.
However, they exhibit surprising failures when performing tasks involving visual relations.
arXiv Detail & Related papers (2024-06-22T22:43:10Z) - Interpretable Visual Question Answering via Reasoning Supervision [4.76359068115052]
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task.
We propose a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal.
We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase.
arXiv Detail & Related papers (2023-09-07T14:12:31Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.