VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2510.09733v1
- Date: Fri, 10 Oct 2025 13:34:23 GMT
- Title: VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation
- Authors: Yubo Sun, Chunyi Peng, Yukun Yan, Shi Yu, Zhenghao Liu, Chi Chen, Zhiyuan Liu, Maosong Sun,
- Abstract summary: Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations.<n>Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions.<n>We propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue.
- Score: 64.82775032985485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27\% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective.
Related papers
- DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models [17.001413023262675]
We propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning.<n> Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks.<n>It achieves 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B.
arXiv Detail & Related papers (2026-03-04T09:06:47Z) - ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering [54.72902502486611]
ReAG is a Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages.<n>ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.
arXiv Detail & Related papers (2025-11-27T19:01:02Z) - Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning [29.78411369746505]
PEARL is a dual-branch, perception-reasoning synergistic that strengthens multimodal reasoning by explicitly anchoring it to verified visual evidence.<n>PEARL achieves substantial gains on multimodal reasoning benchmarks, e.g., a +9.7% improvement over the baseline and +6.6% over GRPO on MathVerse.
arXiv Detail & Related papers (2025-11-23T13:15:58Z) - Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning [55.232400251303794]
Look As You Think (LAT) is a reinforcement learning framework that trains models to produce verifiable reasoning paths with consistent attribution.<n>LAT consistently improves the vanilla model in both single- and multi-image settings, yielding average gains of 8.23% in soft exact match (EM) and 47.0% in IoU@0.5.
arXiv Detail & Related papers (2025-11-15T02:50:23Z) - VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning [49.610569478718226]
multimodal reward models (RMs) have substantially improved post-training for visual generative models.<n>VideoReward Thinker (VR-Thinker) is a thinking-with-image framework that equips the RM with visual reasoning operations and a visual memory window.<n>Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks.
arXiv Detail & Related papers (2025-10-12T09:29:50Z) - CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding [1.6257248483123767]
We present textbfCoRGI(textbfChain textbfof textbfReasoning with textbfGrounded textbfInsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs.
arXiv Detail & Related papers (2025-08-01T07:17:12Z) - Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [69.10441885629787]
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge.<n>It falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts.<n>This survey synthesizes both strands under a unified reasoning-retrieval perspective.
arXiv Detail & Related papers (2025-07-13T03:29:41Z) - Visual-RFT: Visual Reinforcement Fine-Tuning [75.20572976629646]
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers.<n>Visual-RFT further extends the application areas of RFT on visual tasks.
arXiv Detail & Related papers (2025-03-03T18:16:32Z) - ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents [27.90338725230132]
ViDoSeek is a dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning.<n>We propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents.<n> Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark.
arXiv Detail & Related papers (2025-02-25T09:26:12Z) - Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries [30.692007887121278]
Retrieval-augmented generation (RAG) augments large language models with external knowledge to tackle knowledge-intensive questions.<n>Visual-RAG is a question-answering benchmark that targets visually grounded, knowledge-intensive questions.<n>We evaluate 5 open-source and 3 proprietary MLLMs, showcasing that images provide strong evidence in augmented generation.
arXiv Detail & Related papers (2025-02-23T16:23:50Z) - UniRAG: Universal Retrieval Augmentation for Large Vision Language Models [76.30799731147589]
We introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference.<n>Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models and smaller open-source models significantly enhance their generation quality.
arXiv Detail & Related papers (2024-05-16T17:58:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.