Fugu-MT 論文翻訳(概要): PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

論文の概要: PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

arxiv url: http://arxiv.org/abs/2510.16505v2
Date: Tue, 21 Oct 2025 12:52:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:11.826089
Title: PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
Title（参考訳）: PRISMM-Bench: ピアレビューに基づくマルチモーダル不整合のベンチマーク
Authors: Lukas Selch, Yufang Hou, M. Jehanzeb Mirza, Sivan Doveh, James Glass, Rogerio Feris, Wei Lin,
Abstract要約: PRISMM-Benchは、科学論文において、実際のレビュアーがフラッグした不整合に基づいた最初のベンチマークである。不整合同定、治療、ペアマッチングという3つのタスクを設計し、不整合の検出、修正、推論を行うモデルの能力を評価する。我々は、大きなオープンウェイトモデル(GLM-4.5V 106B、InternVL3 78B)やプロプライエタリモデル(Gemini 2.5 Pro、GPT-5)を含む21のLMMをベンチマークした。
参考スコア（独自算出の注目度）: 16.537126902822127
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.
Abstract（参考訳）: 大規模マルチモーダルモデル (LMM) は科学的研究にますます応用されているが、論文の多モーダルな複雑さについて確実に理解し、理由付けできるかどうかは不明である。中心的な課題は、テキスト、数字、表、方程式、しばしば微妙でドメイン固有であり、最終的には明確さ、再現性、信頼を損なう問題など、不整合の検出と解決である。既存のベンチマークでは、単一のモダリティを分離するか、現実の複雑さを捉えるのに失敗する合成エラーに依存するか、この問題を見落としている。本稿では,PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models) を紹介する。レビューマイニング,LDM支援フィルタリング,人間による検証の多段階パイプラインを通じて,242枚の論文から262個の不整合を算出した。このセットに基づいて、3つのタスク、すなわち不整合識別、治療、ペアマッチングを設計し、異なるモードにわたる不整合の検出、修正、推論を行うモデルの能力を評価する。さらに,質問を真に理解せずに解答パターンをモデルが活用するマルチ選択評価における選択専用ショートカットの悪名高い問題に対処するために,表面的スタイル的手法への依存を軽減し,言語バイアスを最小限に抑える構造付きJSONベースの解答表現を導入する。大規模オープンウェイトモデル (GLM-4.5V 106B, InternVL3 78B) やプロプライエタリモデル (Gemini 2.5 Pro, GPT-5) を含む21のLMMをベンチマークした。結果は、非常に低いパフォーマンス (26.1-54.2%) を示し、マルチモーダルな科学的推論の課題と、信頼できる科学的アシスタントへの進歩の動機を浮き彫りにした。

論文の概要: PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

関連論文リスト