CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding
- URL: http://arxiv.org/abs/2508.00378v2
- Date: Sat, 11 Oct 2025 09:17:00 GMT
- Title: CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding
- Authors: Shixin Yi, Lin Shang,
- Abstract summary: We present textbfCoRGI(textbfChain textbfof textbfReasoning with textbfGrounded textbfInsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs.
- Score: 1.6257248483123767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal reasoning with vision-language models (VLMs) often suffers from hallucinations, as models tend to generate explanations after only a superficial inspection of the image. We present \textbf{CoRGI}(\textbf{C}hain \textbf{o}f \textbf{R}easoning with \textbf{G}rounded \textbf{I}nsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs. Given a VLM-generated rationale, CoRGI decomposes it into step-wise statements, grounds each step in visual evidence, and filters or corrects unsupported claims before producing the final answer. Experiments on five challenging benchmark-VCR, ScienceQA, MMMU, MathVista, and HallusionBenc-demonstrate that CoRGI consistently improves both answer accuracy and explanation faithfulness across multiple VLM backbones, including Qwen-2.5VL, LLaVA-1.6, and Gemma3-12B. Beyond quantitative gains, qualitative analyses further illustrate how the verification process reduces hallucination and strengthens interpretability, suggesting that post-hoc visual grounding is a promising direction for building more trustworthy and transparent multimodal reasoning systems.
Related papers
- See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs [24.90876091319589]
We present an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning.<n>Our key idea is to supervise each reasoning step at test time with visual evidence.<n>Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench.
arXiv Detail & Related papers (2026-02-25T02:13:59Z) - Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization [78.94590726578014]
multimodal reasoning models (MLRMs) remain prone to hallucinations, and effective solutions are still underexplored.<n>We propose C3PO, a training-based mitigation framework comprising textbfCompression and textbfPreference textbfOptimization.
arXiv Detail & Related papers (2026-02-03T11:00:55Z) - LogicGaze: Benchmarking Causal Consistency in Visual Narratives via Counterfactual Verification [41.99844472131922]
We introduce LogicGaze, a novel benchmark framework designed to rigorously interrogate whether Vision-Language Models can validate sequential causal chains against visual inputs.<n>Our tripartite evaluation protocol exposes significant vulnerabilities in state-of-the-art VLMs such as Qwen2.5-VL-72B.<n> LogicGaze advocates for robust, trustworthy multimodal reasoning, with all resources publicly available in an anonymized repository.
arXiv Detail & Related papers (2026-01-30T20:28:01Z) - CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation [6.356820150960838]
We introduce two complementary approaches inspired by test-time scaling to stabilize vision-language models.<n>CASHEW is an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces.<n>CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence.
arXiv Detail & Related papers (2026-01-12T21:24:45Z) - Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning [55.232400251303794]
Look As You Think (LAT) is a reinforcement learning framework that trains models to produce verifiable reasoning paths with consistent attribution.<n>LAT consistently improves the vanilla model in both single- and multi-image settings, yielding average gains of 8.23% in soft exact match (EM) and 47.0% in IoU@0.5.
arXiv Detail & Related papers (2025-11-15T02:50:23Z) - S-Chain: Structured Visual Chain-of-Thought For Medicine [81.97605645734741]
We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT)<n>The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability.<n>S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical vision-language models.
arXiv Detail & Related papers (2025-10-26T15:57:14Z) - VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation [64.82775032985485]
Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations.<n>Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions.<n>We propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue.
arXiv Detail & Related papers (2025-10-10T13:34:23Z) - CoFFT: Chain of Foresight-Focus Thought for Visual Language Models [61.34272727005052]
Chain of Foresight-Focus Thought (CoFFT) is a training-free approach that enhances visual reasoning by emulating human visual cognition.<n>These stages function iteratively, creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning.<n> Empirical results across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of 3.1-5.8% with controllable increasing computational overhead.
arXiv Detail & Related papers (2025-09-26T07:46:30Z) - A Survey on Latent Reasoning [100.54120559169735]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities.<n>CoT reasoning that verbalizes intermediate steps limits the model's expressive bandwidth.<n>Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state.
arXiv Detail & Related papers (2025-07-08T17:29:07Z) - CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection [60.98964268961243]
We propose that guiding models to perform a systematic and comprehensive reasoning process allows models to execute much finer-grained and accurate entailment decisions.<n>We define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection.
arXiv Detail & Related papers (2025-06-05T17:02:52Z) - RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought [6.037123011622866]
RSVP is a framework that unifies multi-step multimodal reasoning with grounded visual understanding.<n> RSVP exploits MLLMs' inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations.<n>Our experiments demonstrate RSVP state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings.
arXiv Detail & Related papers (2025-06-04T02:07:40Z) - Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning [96.01617809845396]
Ground-R1 is a reinforcement learning framework that enables grounded visual reasoning without requiring explicit evidence or rationale annotations.<n>Ground-R1 achieves superior performance and exhibits emergent cognitive behaviors such as uncertainty awareness, spatial perception, and iterative refinement.
arXiv Detail & Related papers (2025-05-26T17:51:47Z) - Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z) - GRIT: Teaching MLLMs to Think with Images [22.74533687444133]
Grounded Reasoning with Images and Texts (GRIT) is a novel method for training MLLMs to think with images.<n>GRIT generates reasoning chains that interleave natural language and explicit bounding box coordinates.<n>GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets.
arXiv Detail & Related papers (2025-05-21T17:54:49Z) - Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning [19.28434717501445]
Visual reasoning abilities play a crucial role in understanding complex multimodal data.<n>Existing methods improve VLM reasoning via Chain-of-Thought supervised fine-tuning.<n>We propose Reason-RFT, a novel reinforcement fine-tuning framework.
arXiv Detail & Related papers (2025-03-26T17:38:06Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom [41.369481426130186]
We introduce a novel visual reasoning framework named ProReason.<n>ProReason features decoupled vision-reasoning capabilities and multi-run proactive perception.<n>Our experiments demonstrate that ProReason outperforms existing multi-step reasoning frameworks on various benchmarks.
arXiv Detail & Related papers (2024-10-18T03:22:06Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.