See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs
- URL: http://arxiv.org/abs/2602.21497v1
- Date: Wed, 25 Feb 2026 02:13:59 GMT
- Title: See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs
- Authors: Yongchang Zhang, Xianzheng Ma, Tianyi Liu, Guangquan Zhou, Yang Chen,
- Abstract summary: We present an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning.<n>Our key idea is to supervise each reasoning step at test time with visual evidence.<n>Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench.
- Score: 24.90876091319589
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model's reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.
Related papers
- Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs [55.61018839017648]
Chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks.<n>Existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies.<n>We propose SAYO, a visual reasoning model trained with a reinforcement learning framework that introduces a region-level visual attention-based reward.
arXiv Detail & Related papers (2026-02-09T03:33:23Z) - CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation [6.356820150960838]
We introduce two complementary approaches inspired by test-time scaling to stabilize vision-language models.<n>CASHEW is an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces.<n>CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence.
arXiv Detail & Related papers (2026-01-12T21:24:45Z) - Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z) - Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images [53.373427633330515]
We propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT.<n>Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs.<n>In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern.<n>In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern.
arXiv Detail & Related papers (2025-12-19T07:44:43Z) - See, Think, Learn: A Self-Taught Multimodal Reasoner [3.443084677278651]
We propose a simple yet effective self-training framework called See-Think-Learn.<n>At its core, STL introduces a structured reasoning template that encourages the model to see before thinking.<n>We augment the training data with negative rationales to enhance the model's ability to distinguish between correct and misleading responses.
arXiv Detail & Related papers (2025-12-02T06:30:10Z) - Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z) - DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning [11.952788515297913]
DeFacto is a counterfactual reasoning framework that jointly enforces accurate answering and faithful reasoning.<n>We develop a pipeline that automatically localizes question-relevant evidence and constructs positive, counterfactual, and random variants.<n> Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and reasoning faithfulness.
arXiv Detail & Related papers (2025-09-25T08:58:10Z) - Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning [95.44766931218896]
Multi-modal large language models (MLLMs) still lag behind text-based reasoning.<n>We introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable.<n>We propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO) to align the MLLM's perceptual output with the final reasoning task.
arXiv Detail & Related papers (2025-06-05T02:28:07Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models [36.119299938503936]
Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks.
They remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions.
We propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning.
arXiv Detail & Related papers (2024-07-16T06:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.