Imagination Helps Visual Reasoning, But Not Yet in Latent Space
- URL: http://arxiv.org/abs/2602.22766v1
- Date: Thu, 26 Feb 2026 08:56:23 GMT
- Title: Imagination Helps Visual Reasoning, But Not Yet in Latent Space
- Authors: You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang, Jinan Xu, Maosong Sun,
- Abstract summary: We investigate the validity of latent reasoning using Causal Mediation Analysis.<n>We show that latent tokens encode limited visual information and exhibit high similarity.<n>We propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text.
- Score: 65.80396132375571
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.
Related papers
- How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? [45.11635323173876]
We conduct a comprehensive analysis of latent reasoning methods to better understand the role and behavior of latent representation in the process.<n>We find that while latent representations can encode multiple possibilities, the reasoning process does not faithfully implement structured search.<n>Our findings reveal a trade-off associated with supervision strength: stronger supervision mitigates shortcut behavior but restricts the ability of latent representations to maintain diverse hypotheses.
arXiv Detail & Related papers (2026-02-25T22:00:59Z) - Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization [78.94590726578014]
multimodal reasoning models (MLRMs) remain prone to hallucinations, and effective solutions are still underexplored.<n>We propose C3PO, a training-based mitigation framework comprising textbfCompression and textbfPreference textbfOptimization.
arXiv Detail & Related papers (2026-02-03T11:00:55Z) - Forest Before Trees: Latent Superposition for Efficient Visual Reasoning [61.29300723302152]
Laser is a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning (DWAL)<n>Laser achieves state-of-the-art performance among latent reasoning methods, surpassing the strong baseline Monet by 5.03% on average.
arXiv Detail & Related papers (2026-01-11T08:30:49Z) - Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z) - On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models [27.228426342808486]
We argue that uncertain visual tokens within the vision encoder (VE) is a key factor that contributes to object hallucination.<n>We propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only.
arXiv Detail & Related papers (2025-10-10T05:12:52Z) - Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity [25.725999088297392]
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across vision-language tasks.<n>They may suffer from hallucinations--generating outputs that are semantically inconsistent with the input image or text.<n>We propose a novel reinforcement learning framework guided by causal completeness.
arXiv Detail & Related papers (2025-08-06T08:09:12Z) - A Survey on Latent Reasoning [100.54120559169735]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities.<n>CoT reasoning that verbalizes intermediate steps limits the model's expressive bandwidth.<n>Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state.
arXiv Detail & Related papers (2025-07-08T17:29:07Z) - I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.