PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
- URL: http://arxiv.org/abs/2510.16505v2
- Date: Tue, 21 Oct 2025 12:52:54 GMT
- Title: PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
- Authors: Lukas Selch, Yufang Hou, M. Jehanzeb Mirza, Sivan Doveh, James Glass, Rogerio Feris, Wei Lin,
- Abstract summary: We introduce PRISMM-Bench, the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers.<n>We design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies.<n>We benchmark 21 leading LMMs, including large openweight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning)
- Score: 16.537126902822127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.
Related papers
- Multimodal Fact-Level Attribution for Verifiable Reasoning [80.60864342985748]
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation.<n>Existing multimodal grounding benchmarks and evaluation methods fail to assess attribution in complex multimodal reasoning.<n>We introduce MuRGAt, a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation.
arXiv Detail & Related papers (2026-02-12T03:10:02Z) - PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection [33.98107848014898]
We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate multimodal large language models (MLLMs)<n>Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought containing exactly one error, models must identify the first incorrect step.<n>The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching.
arXiv Detail & Related papers (2025-10-27T17:57:52Z) - MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval [86.35779264575154]
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios.<n>We introduce MR$2$-Bench, a reasoning-intensive benchmark for multimodal retrieval.
arXiv Detail & Related papers (2025-09-30T15:09:14Z) - Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges [72.3356133063925]
The paradigm of large language models (LLMs) as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings.<n>Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals.
arXiv Detail & Related papers (2025-09-03T15:48:33Z) - GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning [12.724393910603299]
We introduce the Generative Multimodal Process Reward Model (GM-PRM)<n>Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step.<n>We show that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks.
arXiv Detail & Related papers (2025-08-06T05:10:29Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning [40.55833679660528]
We introduce the MMLU-Reason, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking.<n>The MMLU-Reason comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands.<n>Overall, the MMLU-Reason offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.
arXiv Detail & Related papers (2025-05-22T09:41:55Z) - Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models [26.17300490736624]
Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs.<n>We propose the Multimodal Inconsistency Reasoning benchmark to assess MLLMs' ability to detect and reason about semantic mismatches.<n>We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts.
arXiv Detail & Related papers (2025-02-22T01:52:37Z) - Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language Models [45.63440666848143]
Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities.<n>Despite their success, MLLMs remain vulnerable to conversational adversarial inputs.<n>We study gaslighting negation attacks: a phenomenon where models, despite initially providing correct answers, are persuaded by user-provided negations to reverse their outputs.
arXiv Detail & Related papers (2025-01-31T10:37:48Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs [78.31625291513589]
We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps.
We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency and compositional consistency.
We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.
arXiv Detail & Related papers (2023-05-23T17:25:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.