Fugu-MT 論文翻訳(概要): PeerPrism: Peer Evaluation Expertise vs Review-writing AI

論文の概要: PeerPrism: Peer Evaluation Expertise vs Review-writing AI

arxiv url: http://arxiv.org/abs/2604.14513v1
Date: Thu, 16 Apr 2026 00:59:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.664144
Title: PeerPrism: Peer Evaluation Expertise vs Review-writing AI
Title（参考訳）: PeerPrism: Peer Evaluation Expertise vs Review-writing AI
Authors: Soroush Sadeghian, Alireza Daqiq, Radin Cheraghi, Sajad Ebrahimi, Negar Arabzadeh, Ebrahim Bagheri,
Abstract要約: 20,690のピアレビューのベンチマークであるPeerPrismを紹介した。我々はPeerPrism上で最先端のLLMテキスト検出手法をベンチマークする。以上の結果から,現在の検出手法は表面実現と知的貢献を両立させることが示唆された。
参考スコア（独自算出の注目度）: 12.533035088439975
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) are increasingly used in scientific peer review, assisting with drafting, rewriting, expansion, and refinement. However, existing peer-review LLM detection methods largely treat authorship as a binary problem-human vs. AI-without accounting for the hybrid nature of modern review workflows. In practice, evaluative ideas and surface realization may originate from different sources, creating a spectrum of human-AI collaboration. In this work, we introduce PeerPrism, a large-scale benchmark of 20,690 peer reviews explicitly designed to disentangle idea provenance from text provenance. We construct controlled generation regimes spanning fully human, fully synthetic, and multiple hybrid transformations. This design enables systematic evaluation of whether detectors identify the origin of the surface text or the origin of the evaluative reasoning. We benchmark state-of-the-art LLM text detection methods on PeerPrism. While several methods achieve high accuracy on the standard binary task (human vs. fully synthetic), their predictions diverge sharply under hybrid regimes. In particular, when ideas originate from humans but the surface text is AI-generated, detectors frequently disagree and produce contradictory classifications. Accompanied by stylometric and semantic analyses, our results show that current detection methods conflate surface realization with intellectual contribution. Overall, we demonstrate that LLM detection in peer review cannot be reduced to a binary attribution problem. Instead, authorship must be modeled as a multidimensional construct spanning semantic reasoning and stylistic realization. PeerPrism is the first benchmark evaluating human-AI collaboration in these settings. We release all code, data, prompts, and evaluation scripts to facilitate reproducible research at https://github.com/Reviewerly-Inc/PeerPrism.
Abstract（参考訳）: 大規模言語モデル(LLM)は、科学的なピアレビューにおいて、ドラフト作成、書き直し、拡張、改善の支援として、ますます使われている。しかし、既存のピアレビューLPM検出手法は、現代のレビューワークフローのハイブリッドな性質を考慮せずに、著者をバイナリ問題人間対AIとして扱う。実際には、評価的アイデアと表面的実現は異なるソースから生まれ、人間とAIのコラボレーションのスペクトルを形成する。本研究では,20,690人のピアレビューの大規模ベンチマークであるPeerPrismを紹介する。我々は、完全に人間的で完全に合成され、複数のハイブリッド変換にまたがる制御された生成機構を構築した。この設計により、検出器が表面テキストの起源や評価的推論の起源を識別するかどうかを体系的に評価することができる。我々はPeerPrism上で最先端のLLMテキスト検出手法をベンチマークする。いくつかの手法が標準的なバイナリータスク(人間対完全合成)で高い精度を達成する一方で、それらの予測はハイブリッドな状態下で急激に分岐する。特に、アイデアが人間に由来するが、表面のテキストがAIによって生成される場合、検出器はしばしば矛盾し、矛盾する分類を生成する。本研究は,テクスチャ分析とセマンティック分析を併用して,表面実現と知的貢献を両立させる現在の検出手法であることを示す。全体として、ピアレビューにおけるLLM検出はバイナリ属性問題に還元できないことを示す。その代わり、著者は意味論的推論とスタイリスティックな実現にまたがる多次元構造としてモデル化されなければならない。 PeerPrismは、これらの設定で人間とAIのコラボレーションを評価する最初のベンチマークである。我々は、https://github.com/Reviewerly-Inc/PeerPrismで再現可能な研究を促進するために、すべてのコード、データ、プロンプト、評価スクリプトをリリースします。

論文の概要: PeerPrism: Peer Evaluation Expertise vs Review-writing AI

関連論文リスト