Fugu-MT 論文翻訳(概要): PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

論文の概要: PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

arxiv url: http://arxiv.org/abs/2605.26730v1
Date: Tue, 26 May 2026 09:06:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.775906
Title: PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers
Title（参考訳）: PRISM: LLMピアレビューアの評価のための多次元ベンチマーク
Authors: Ngoc Phan Phuoc Loc, Toan Huynh La Viet, Thanh Tran Khanh, Duy A Nguyen, Tuan Anh Nguyen Pham, Thanh Nguyen, Nitesh V. Chawla, Wray Buntine, Kok-Seng Wong, Khoa D. Doan, Binh T. Nguyen,
Abstract要約: PRISMは,4次元にわたるレビュー品質を評価するベンチマークフレームワークである。我々は、ICLR、ICML、NeurIPSの階層化されたレビューコーパスにおいて、主要な自動レビュアシステムと人間レビュアをベンチマークする。単一のシステムは、すべての次元にわたる人間のベースラインのバランスの取れた性能を一度に一致させるものはない。
参考スコア（独自算出の注目度）: 30.106132038073138
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification & Major Issues Prioritization, and Multi-dimensional Constructiveness. Unlike most existing evaluations based on surface-level metrics like ROUGE and BLEU, or unconstrained LLM-as-a-judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval-augmented verification, and consensus-based scoring. We apply PRISM to benchmark five leading automated reviewer systems and human reviewers on a stratified corpus of reviews from ICLR, ICML, and NeurIPS. The results reveal that LLMs can match or beat human reviewers on individual dimensions: comparable depth of analysis, stronger novelty verification, and highly accurate critique prioritization. However, no single system consistently matches the balanced performance of the human baseline across all dimensions at once. Each exhibits a distinct specialization profile with characteristic blind spots -- failure modes that aggregate metrics miss entirely. The implication is that LLM reviewers are best understood as targeted supplements to human review, effective within specific dimensions, but unreliable as standalone replacements. Our demo and key results can be found at https://khanhthanhdev.github.io/prism-page/.
Abstract（参考訳）: 機械学習の会場への応募が急速に増加し、科学的なピアレビューシステムが悪化し、LLMベースの自動ピアレビューシステムへの関心が高まっている。しかし、これらのシステムが実際にいかに優れているか、特に科学的ギャップを捉えている人間のレビュアーと比較しては、まだ理解されていない。本研究では,PRISM(Peer Review Intelligence via Structured Multi-dimensional Assessment)という,4次元にわたるレビュー品質を評価するベンチマークフレームワークを紹介した。 ROUGE(英語版)やBLEU(英語版)のような表面レベルのメトリクスに基づく既存の評価や、厳密さとフラレンシを混同させるような制約のないLLM-as-a-judge(英語版)などとは異なり、PRISMは引数マイニング、検索強化検証、コンセンサスに基づくスコアリングの各次元を基礎としている。我々はPRISMを、ICLR、ICML、NeurIPSの階層化されたレビューコーパス上で、主要な5つの自動レビュアシステムと人間レビュアのベンチマークに応用する。その結果、LLMは、比較分析の深さ、より強力なノベルティ検証、高度に正確な批評優先順位付けといった、人間のレビュアーを個別の次元で一致または打ち負かすことができることが明らかとなった。しかしながら、すべての次元にわたる人間のベースラインのバランスの取れた性能に一貫した一貫したシステムは存在しない。それぞれが特徴的な盲点を持つ、独自の特殊化プロファイルを示す -- メトリクスを集約する障害モードが完全に失われている。 LLMレビュアーは、人間のレビューを対象とするサプリメントとして理解され、特定の次元で有効であるが、スタンドアローンの代替品として信頼できない。私たちのデモと重要な結果は、https://khanhthanhdev.github.io/prism-page/.comで確認できます。

論文の概要: PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

関連論文リスト