Related papers: Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy

Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy

URL: http://arxiv.org/abs/2601.06801v1
Date: Sun, 11 Jan 2026 08:25:34 GMT
Title: Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy
Authors: Shujian Gao, Yuan Wang, Jiangtao Yan, Zuxuan Wu, Yu-Gang Jiang,
Abstract summary: Reinforcement Learning with Verifiable Rewards has significantly advanced reasoning capabilities in Large Language Models.<n>Existing paradigms, driven by text-centric outcome rewards, encourage models to bypass visual perception.<n>We propose textbfThinking with Deltas, a framework driven by a textbfDifferential Visual Reasoning Policy.
Score: 75.66913260900726
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced reasoning capabilities in Large Language Models. However, adapting RLVR to multimodal domains suffers from a critical \textit{perception-reasoning decoupling}. Existing paradigms, driven by text-centric outcome rewards, reasoning in language medium, inadvertently encourage models to bypass visual perception. We empirically validate this through blind experiments: state-of-the-art policies maintain or surprisingly improve performance even when visual inputs are entirely removed. This reveals that these models degenerate into \textit{blind reasoners}, exploiting linguistic priors to generate plausible answers instead of attending to visual evidence. In response, we propose \textbf{Thinking with Deltas}, a framework driven by a \textbf{Differential Visual Reasoning Policy (DVRP)}. DVRP introduces intrinsic supervision via visual triplets, comprising original, masked, and perturbed inputs. It optimizes the model to maximize reasoning divergence from masked inputs (enforcing \textit{visual sensitivity}) while minimizing divergence from perturbed inputs (ensuring \textit{visual robustness}). By aligning reasoning variations strictly with the \textit{Delta} of visual information, DVRP inherently bolsters visual understanding capabilities and significantly outperforms state-of-the-art methods on both general and medical benchmarks, without requiring external annotations or auxiliary tools.

Related papers

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs [60.93949629734977]
We propose Visual Contrastive Self-Taught Reasoner (VC-STaR) to mitigate hallucinations in model-generated rationales.<n>We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR.<n>Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets.
arXiv Detail & Related papers (2026-03-03T03:18:31Z)
Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization [78.94590726578014]
multimodal reasoning models (MLRMs) remain prone to hallucinations, and effective solutions are still underexplored.<n>We propose C3PO, a training-based mitigation framework comprising textbfCompression and textbfPreference textbfOptimization.
arXiv Detail & Related papers (2026-02-03T11:00:55Z)
Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z)
Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning [29.78411369746505]
PEARL is a dual-branch, perception-reasoning synergistic that strengthens multimodal reasoning by explicitly anchoring it to verified visual evidence.<n>PEARL achieves substantial gains on multimodal reasoning benchmarks, e.g., a +9.7% improvement over the baseline and +6.6% over GRPO on MathVerse.
arXiv Detail & Related papers (2025-11-23T13:15:58Z)
Latent Visual Reasoning [40.347006722601975]
We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space.<n>We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL.
arXiv Detail & Related papers (2025-09-29T03:52:01Z)
Self-Rewarding Vision-Language Model via Reasoning Decomposition [49.784411666601905]
Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts.<n>We introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions.<n>Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts.
arXiv Detail & Related papers (2025-08-27T08:01:03Z)
D-Attn: Decomposed Attention for Large Vision-and-Language Models [29.611769371733672]
We propose Decomposed Attention (D-Attn), a more flexible attention architecture for large vision-and-language models (LVLMs)<n>D-Attn decomposes the 1-D causal self-attention of LVLMs into visual-to-visual, textual-to-visual, and textual-to-textual attentions.<n>Experiments and analysis validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks.
arXiv Detail & Related papers (2025-02-04T00:46:11Z)
Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z)
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models [36.119299938503936]
Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. They remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. We propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning.
arXiv Detail & Related papers (2024-07-16T06:32:45Z)
Interpretable Visual Question Answering via Reasoning Supervision [4.76359068115052]
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task. We propose a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal. We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase.
arXiv Detail & Related papers (2023-09-07T14:12:31Z)
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings [61.04460792203266]
We introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to bridge the logical gaps within sequential data. Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks.
arXiv Detail & Related papers (2023-05-03T17:58:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.