PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection
- URL: http://arxiv.org/abs/2510.23594v2
- Date: Tue, 28 Oct 2025 02:07:50 GMT
- Title: PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection
- Authors: Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, Zhe Gan,
- Abstract summary: We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate multimodal large language models (MLLMs)<n>Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought containing exactly one error, models must identify the first incorrect step.<n>The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching.
- Score: 33.98107848014898
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress on vision-language tasks, yet their reasoning processes remain sometimes unreliable. We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.
Related papers
- On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks [56.98385132295952]
We evaluate how well chain-of-thought approaches generalize on a simple planning task.<n>We find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization.<n> purely text-based models consistently outperform those utilizing image-based inputs.
arXiv Detail & Related papers (2026-02-17T09:51:40Z) - CoT-Seg: Rethinking Segmentation with Chain-of-Thought Reasoning and Self-Correction [50.67483317563736]
This paper aims to explore a system that can think step-by-step, look up information if needed, generate results, self-evaluate its own results, and refine the results.<n>We introduce CoT-Seg, a training-free framework that rethinks reasoning segmentation by combining chain-of-thought reasoning with self-correction.
arXiv Detail & Related papers (2026-01-24T11:41:54Z) - MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models [49.32415342913976]
We introduce MM-CoT, a diagnostic benchmark designed to probe the visual grounding and logical coherence of CoT reasoning in multimodal models.<n>We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity.
arXiv Detail & Related papers (2025-12-09T04:13:31Z) - CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution [20.823419395675412]
CrossCheck-Bench is a diagnostic benchmark for evaluating contradiction detection in multimodal inputs.<n>We evaluate 13 state-of-the-art vision-language models and observe a consistent performance drop as tasks shift from perceptual matching to logical contradiction detection.
arXiv Detail & Related papers (2025-11-19T12:17:15Z) - PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies [16.537126902822127]
We introduce PRISMM-Bench, the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers.<n>We design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies.<n>We benchmark 21 leading LMMs, including large openweight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning)
arXiv Detail & Related papers (2025-10-18T13:46:26Z) - seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs [1.0519693622157462]
We introduce seqBench, a benchmark for probing sequential reasoning limits in Large Language Models (LLMs)<n>We find that even top-performing models systematically fail on seqBench's structured reasoning tasks despite minimal search complexity.
arXiv Detail & Related papers (2025-09-21T01:32:13Z) - Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.497499123166804]
This work studies the performance of frontier language models on a broad set of "easy" reasoning problems.<n>We create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning.<n>We show that even state-of-the-art thinking models consistently fail on such problems and for similar reasons.
arXiv Detail & Related papers (2025-07-09T22:22:49Z) - PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts [47.92619068073141]
We introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning.<n>Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy.<n>Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning.
arXiv Detail & Related papers (2025-06-06T16:17:09Z) - TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models [5.6525926183880255]
We introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task.<n>In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds.<n>TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains.
arXiv Detail & Related papers (2025-06-02T05:47:50Z) - THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models [65.39456695678713]
We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists.<n>We find that in general, reasoning models are poorly calibrated, particularly on easy problems.<n>We introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.
arXiv Detail & Related papers (2025-04-17T22:16:30Z) - PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving [50.50405233978406]
We propose a fully dynamic multimodal evaluation framework, named Open-ended Visual Puzzle Generation (OVPG)<n>OVPG aims to generate fresh, diverse, and verifiable evaluation data automatically in puzzle-solving tasks.<n>Built upon OVPG, we construct PuzzleBench, a dynamic and scalable benchmark comprising 11,840 VQA samples.
arXiv Detail & Related papers (2025-04-15T05:29:31Z) - FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [90.88021670297664]
FINEREASON is a logic-puzzle benchmark for evaluation of large language models' reasoning capabilities.<n>We introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move.<n>We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
arXiv Detail & Related papers (2025-02-27T16:23:25Z) - Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models [26.17300490736624]
Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs.<n>We propose the Multimodal Inconsistency Reasoning benchmark to assess MLLMs' ability to detect and reason about semantic mismatches.<n>We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts.
arXiv Detail & Related papers (2025-02-22T01:52:37Z) - WILT: A Multi-Turn, Memorization-Robust Inductive Logic Benchmark for LLMs [0.8883751685905831]
We introduce the Wason Inductive Logic Test (WILT), a simple yet challenging multi-turn reasoning benchmark designed to resist memorization.
Our findings reveal that LLMs struggle with this task, exhibiting distinct strengths and weaknesses.
Despite these variations, the best-performing model achieves only 28% accuracy, highlighting a significant gap in LLM performance on complex multi-turn reasoning tasks.
arXiv Detail & Related papers (2024-10-14T18:29:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.