From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning
- URL: http://arxiv.org/abs/2511.23031v1
- Date: Fri, 28 Nov 2025 09:52:56 GMT
- Title: From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning
- Authors: Changpeng Wang, Haozhe Wang, Xi Chen, Junhan Liu, Taofeng Xue, Chong Peng, Donglian Qi, Fangzhen Lin, Yunfeng Yan,
- Abstract summary: We propose Visual Rationale Learning (ViRL), an end-to-end paradigm that grounds training in the visual rationale itself.<n>ViRL integrates (1) Process Supervision with ground-truth rationales, (2) Objective Alignment via step-level reward shaping, and (3) Fine-Grained Credit Assignment to distinguish correct, redundant, and erroneous actions.<n>This work establishes visual rationalization as a task-agnostic, process-grounded paradigm for building transparent, verifiable, and trustworthy vision-language models.
- Score: 19.84653798433995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in vision-language reasoning underscore the importance of thinking with images, where models actively ground their reasoning in visual evidence. Yet, prevailing frameworks treat visual actions as optional tools, boosting metrics but leaving reasoning ungrounded and crops ineffective. This gap gives rise to the illusion of thinking with images: models seem visually grounded but rely on context-agnostic actions that neither refine perception nor guide reasoning toward correct answers. We address this problem by reframing visual actions as core reasoning primitives rather than optional tools, which we term visual rationalization, the visual analogue of textual Chain-of-Thought. Building on this insight, we propose Visual Rationale Learning (ViRL), an end-to-end paradigm that grounds training in the visual rationale itself. ViRL integrates (1) Process Supervision with ground-truth rationales, (2) Objective Alignment via step-level reward shaping, and (3) Fine-Grained Credit Assignment to distinguish correct, redundant, and erroneous actions. By ensuring each action contributes meaningfully to the reasoning chain, ViRL enables models to "get the right answer for the right visual reason". Trained purely with end-to-end RL, ViRL achieves state-of-the-art results across benchmarks spanning perception, hallucination, and reasoning. This work establishes visual rationalization as a task-agnostic, process-grounded paradigm for building transparent, verifiable, and trustworthy vision-language models.
Related papers
- Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs [60.93949629734977]
We propose Visual Contrastive Self-Taught Reasoner (VC-STaR) to mitigate hallucinations in model-generated rationales.<n>We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR.<n>Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets.
arXiv Detail & Related papers (2026-03-03T03:18:31Z) - Think Visually, Reason Textually: Vision-Language Synergy in ARC [94.15522924153264]
ARC-AGI is a rigorous testbed for conceptual rule induction and transfer to novel tasks.<n>Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction.<n>We introduce two synergistic strategies: Vision-Language Synergy Reasoning (VLSR) and Modality-Switch Self-Correction (MSSC)<n>Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence.
arXiv Detail & Related papers (2025-11-19T18:59:04Z) - BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception [67.89135437537179]
We introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks.<n>Instead of relying on external knowledge, our tasks require models to reason from visual content alone.<n>Compared to prior perception benchmarks, it moves beyond shallow perception and requires fine-grained observation and analytical reasoning.
arXiv Detail & Related papers (2025-10-10T13:14:13Z) - Self-Rewarding Vision-Language Model via Reasoning Decomposition [49.784411666601905]
Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts.<n>We introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions.<n>Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts.
arXiv Detail & Related papers (2025-08-27T08:01:03Z) - MiCo: Multi-image Contrast for Reinforcement Visual Reasoning [72.81576836419373]
Chain-of-Thought (CoT) reasoning can be used to link visual cues across multiple images.<n>We adapt rule-based reinforcement learning for Vision-Language Models (VLMs)<n>Our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.
arXiv Detail & Related papers (2025-06-27T17:59:27Z) - Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT [24.085953089267772]
We show how OpenAI o3 and GPT-4o fail to grasp basic physical laws, spatial interactions, and causal effects in complex scenes.<n>We introduce MVPBench, a benchmark designed to rigorously evaluate visual physical reasoning through the lens of visual chain-of-thought (CoT)<n> Experimental results reveal a concerning trend: even cutting-edge MLLMs exhibit poor visual reasoning accuracy and weak image-text alignment in physical domains.
arXiv Detail & Related papers (2025-05-30T03:48:59Z) - Grounded Reinforcement Learning for Visual Reasoning [51.94871616778874]
We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with reinforcement learning.<n>Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces.<n>Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.
arXiv Detail & Related papers (2025-05-29T17:20:26Z) - DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning [11.242852367476015]
DeepEyes is a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning.<n>We propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories.<n>DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks.
arXiv Detail & Related papers (2025-05-20T13:48:11Z) - A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.