Fugu-MT 論文翻訳(概要): PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing

論文の概要: PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing

arxiv url: http://arxiv.org/abs/2605.29815v1
Date: Thu, 28 May 2026 11:59:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.216778
Title: PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing
Title（参考訳）: PRAIB: LLM支援レビューの行動に関するAIベンチマーク
Authors: Krzysztof Żurawicki, Julia Farganus, Arkadiusz Gaweł, Mateusz Bystroński, Tomasz Jan Kajdanowicz,
Abstract要約: Peer Review AI Benchmark (PRAIB)は、レビューの具体性、スタイル、エンゲージメントの振る舞いを測定する、徹底的に定義されたメトリクスで構成されるフレームワークである。我々は、1000 ICLRおよびNeurIPSの論文に対して、5つのプロプライエタリおよびオープンソースモデルによって生成される11,000のレビューのデータセットを活用する大規模な実証的研究を行う。分析の結果,人的レビュアーのフィードバックから生成したレビューが著しく逸脱していることが判明した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The growing number of submitted papers has motivated the exploration of Large Language Models (LLMs) as a means to support and augment the peer review process, particularly in terms of improving its speed and scalability. Yet, it remains unknown whether LLMs engage with scientific manuscripts in the same manner as human reviewers, or whether they merely produce review-looking text. To address this, we introduce the Peer Review AI Benchmark (PRAIB), a novel framework comprising thoroughly defined metrics that measure review specificity, style, and behavior of engagement. To complement the PRAIB framework, we conduct a large-scale empirical study leveraging a dataset of 11,000 reviews generated by five proprietary and open-source models for 1,000 ICLR and NeurIPS papers. Spanning the 2021--2025 period, these machine-generated reviews are compared against original human feedback across diverse prompting strategies to identify systematic behavioral divergences. Our analysis reveals that the generated reviews diverge significantly from feedback provided by human reviewers: LLM ratings are less variable, positively biased, and overconfident, and their cross-reference patterns are model-dependent and distinct from human norms. Furthermore, when evaluated through PRAIB, we observe that LLMs tend to generate longer, more complex reviews, yet frequently overlook the atomic weaknesses noted by human reviewers. By characterizing where and how LLMs reviewing behavior departs from human norms, PRAIB provides the community with a diagnostic tool for identifying which aspects of the review process LLMs can reliably support today and which require further development before deployment.
Abstract（参考訳）: 大規模言語モデル(LLM)の探索は、特にスピードとスケーラビリティの向上の観点から、ピアレビュープロセスのサポートと強化の手段として、提出された論文が増えている。しかし、LLMが人間のレビュアーと同じ方法で科学的写本を扱うのか、あるいは単にレビュー風のテキストを生成するだけなのかは、まだ不明である。これを解決するために、レビューの特異性、スタイル、エンゲージメントの振る舞いを測定するために、完全に定義されたメトリクスからなる新しいフレームワークであるPeer Review AI Benchmark(PRAIB)を紹介します。 PRAIBフレームワークを補完するために、1000 ICLRおよびNeurIPSの5つのプロプライエタリおよびオープンソースモデルによって生成される11,000のレビューのデータセットを活用する大規模な実証的研究を行った。 2021-2025年の間、これらの機械によるレビューは、組織的な振る舞いの相違を識別するための多様なプロンプト戦略を通して、オリジナルの人間のフィードバックと比較される。 LLM評価は変動が低く、正の偏りがあり、過信であり、それらの相互参照パターンはモデルに依存しており、人間の規範とは異なっている。さらに、PRAIBを用いて評価すると、LLMはより長く、より複雑なレビューを生成する傾向にあるが、人間レビュアーが指摘するアトミックな弱点をしばしば見落としている。 PRAIBは、レビュープロセスのどの側面が現在確実にサポートでき、デプロイ前にさらなる開発を必要とするかを識別する診断ツールをコミュニティに提供する。

論文の概要: PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing

関連論文リスト