Related papers: MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

URL: http://arxiv.org/abs/2512.08228v1
Date: Tue, 09 Dec 2025 04:13:31 GMT
Title: MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models
Authors: Jusheng Zhang, Kaitong Cai, Xiaoyang Guo, Sidi Liu, Qinhan Lv, Ruiqi Chen, Jing Yang, Yijia Fan, Xiaofei Sun, Jian Wang, Ziliang Chen, Liang Lin, Keze Wang,
Abstract summary: We introduce MM-CoT, a diagnostic benchmark designed to probe the visual grounding and logical coherence of CoT reasoning in multimodal models.<n>We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity.
Score: 49.32415342913976
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.

Related papers

To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks [56.11584171938381]
Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions.<n>Recent progress in Large Reasoning Models (LRMs) has boosted step-by-step inference in mathematics and coding.<n>We present a systematic study of nine advanced Large Language Models (LLMs) comparing reasoning models with non-reasoning models.
arXiv Detail & Related papers (2026-02-11T08:16:13Z)
Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models [72.4149653187766]
We propose a Reasoner-Verifier framework named Adrialversa Reasoning RAG (ARR)<n>The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage.<n> Experiments on multiple benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2026-01-08T06:57:03Z)
Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z)
ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better [59.29940512530982]
We propose ChainV, a framework that dynamically integrates visual hints into the reasoning process.<n>Our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks.
arXiv Detail & Related papers (2025-11-21T10:11:17Z)
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation [79.17352367219736]
ROVER tests the use of one modality to guide, verify, or refine outputs in the other.<n>ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning.
arXiv Detail & Related papers (2025-11-03T02:27:46Z)
PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection [33.98107848014898]
We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate multimodal large language models (MLLMs)<n>Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought containing exactly one error, models must identify the first incorrect step.<n>The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching.
arXiv Detail & Related papers (2025-10-27T17:57:52Z)
Hallucination Detection via Internal States and Structured Reasoning Consistency in Large Language Models [7.18947815679122]
Internal State Probing and Chain-of-Thought Verification are used to detect hallucinations in large language models.<n>We develop a unified framework that bridges the gap between the two methods.<n>Our framework consistently and significantly outperforms strong baselines.
arXiv Detail & Related papers (2025-10-13T15:31:21Z)
Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning [33.30315111732609]
Chain of Thought (CoT) reasoning has demonstrated remarkable deep reasoning capabilities.<n>However, its reliability is often undermined by the accumulation of errors in intermediate steps.<n>This paper introduces an approach to calibrate the CoT reasoning accuracy by leveraging the model's intrinsic veracity encoding.
arXiv Detail & Related papers (2025-07-14T07:41:35Z)
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [58.32070787537946]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z)
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models [26.17300490736624]
Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs.<n>We propose the Multimodal Inconsistency Reasoning benchmark to assess MLLMs' ability to detect and reason about semantic mismatches.<n>We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts.
arXiv Detail & Related papers (2025-02-22T01:52:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.