LogicGaze: Benchmarking Causal Consistency in Visual Narratives via Counterfactual Verification
- URL: http://arxiv.org/abs/2602.00292v1
- Date: Fri, 30 Jan 2026 20:28:01 GMT
- Title: LogicGaze: Benchmarking Causal Consistency in Visual Narratives via Counterfactual Verification
- Authors: Rory Driscoll, Alexandros Christoforos, Chadbourne Davis,
- Abstract summary: We introduce LogicGaze, a novel benchmark framework designed to rigorously interrogate whether Vision-Language Models can validate sequential causal chains against visual inputs.<n>Our tripartite evaluation protocol exposes significant vulnerabilities in state-of-the-art VLMs such as Qwen2.5-VL-72B.<n> LogicGaze advocates for robust, trustworthy multimodal reasoning, with all resources publicly available in an anonymized repository.
- Score: 41.99844472131922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While sequential reasoning enhances the capability of Vision-Language Models (VLMs) to execute complex multimodal tasks, their reliability in grounding these reasoning chains within actual visual evidence remains insufficiently explored. We introduce LogicGaze, a novel benchmark framework designed to rigorously interrogate whether VLMs can validate sequential causal chains against visual inputs, specifically targeting the pervasive issue of hallucination. Curated from 40,000 video segments from ShareGPT4Video and a subset of Flickr30k imagery, LogicGaze integrates causal sequences with visually contradictory yet linguistically plausible perturbations, compelling models to verify the authenticity of each reasoning step. Our tripartite evaluation protocol - Causal Validation, Grounded Narrative Synthesis, and Perturbation Rejection - exposes significant vulnerabilities in state-of-the-art VLMs such as Qwen2.5-VL-72B. LogicGaze advocates for robust, trustworthy multimodal reasoning, with all resources publicly available in an anonymized repository.
Related papers
- TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models [19.148124494194317]
We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
arXiv Detail & Related papers (2026-03-02T22:19:13Z) - VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension [51.76841625486355]
Referring Expression (REC) aims to localize the image region corresponding to a natural-language query.<n>Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning.<n>We introduce VIRO, a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps.
arXiv Detail & Related papers (2026-01-19T07:21:19Z) - Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z) - More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models [74.10138874771852]
We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR.<n>For perception, PeRL-VL introduces a VLM-based description reward that scores the model's self-generated image descriptions for faithfulness and sufficiency.<n>For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision.
arXiv Detail & Related papers (2025-12-13T23:06:18Z) - CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution [20.823419395675412]
CrossCheck-Bench is a diagnostic benchmark for evaluating contradiction detection in multimodal inputs.<n>We evaluate 13 state-of-the-art vision-language models and observe a consistent performance drop as tasks shift from perceptual matching to logical contradiction detection.
arXiv Detail & Related papers (2025-11-19T12:17:15Z) - Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning [55.232400251303794]
Look As You Think (LAT) is a reinforcement learning framework that trains models to produce verifiable reasoning paths with consistent attribution.<n>LAT consistently improves the vanilla model in both single- and multi-image settings, yielding average gains of 8.23% in soft exact match (EM) and 47.0% in IoU@0.5.
arXiv Detail & Related papers (2025-11-15T02:50:23Z) - CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding [1.6257248483123767]
We present textbfCoRGI(textbfChain textbfof textbfReasoning with textbfGrounded textbfInsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs.
arXiv Detail & Related papers (2025-08-01T07:17:12Z) - ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs [54.154593699263074]
ProtoReasoning is a framework that enhances the reasoning ability of Large Reasoning Models.<n>ProtoReasoning transforms problems into corresponding prototype representations.<n>ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning.
arXiv Detail & Related papers (2025-06-18T07:44:09Z) - Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models [26.17300490736624]
Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs.<n>We propose the Multimodal Inconsistency Reasoning benchmark to assess MLLMs' ability to detect and reason about semantic mismatches.<n>We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts.
arXiv Detail & Related papers (2025-02-22T01:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.