Fugu-MT 論文翻訳(概要): NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

論文の概要: NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

arxiv url: http://arxiv.org/abs/2604.11543v1
Date: Mon, 13 Apr 2026 14:35:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.611993
Title: NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment
Title（参考訳）: NovBench: 学術論文ノベルティアセスメントによる大規模言語モデルの評価
Authors: Wenqing Wu, Yi Zhao, Yuzhuo Wang, Siyou Li, Juexi Shao, Yunfei Long, Chengzhi Zhang,
Abstract要約: NovBenchは,大規模言語モデルの新規性評価を生成する能力を評価するために設計された,最初の大規模ベンチマークである。 NovBenchは、論文紹介から抽出したノベルティ記述や、それに対応する専門家によるノベルティ評価を含む、主要なNLPカンファレンスから1,684の論文レビューペアで構成されている。 LLMによる新規性評価の質を評価するための4次元評価フレームワークを提案する。
参考スコア（独自算出の注目度）: 17.02516373665209
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promise in generating review comments, the absence of a dedicated benchmark has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs' capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper-review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit articulation of novelty claims, while expert-written novelty evaluations constitute one of the current gold standards of human judgment. Furthermore, we propose a four-dimensional evaluation framework (including Relevance, Correctness, Coverage, and Clarity) to assess the quality of LLM-generated novelty evaluations. Extensive experiments on both general and specialized LLMs under different prompting strategies reveal that current models exhibit limited understanding of scientific novelty, and that fine--tuned models often suffer from instruction-following deficiencies. These findings underscore the need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence.
Abstract（参考訳）: ノベルティは学術出版における中核的な要件であり、ピアレビューの焦点となっているが、提出の量の増加は人間のレビュアーに圧力をかけた。大きな言語モデル(LLM)は、ピアレビューデータに基づいて微調整されているが、レビューコメントを生成することは約束されているが、専用のベンチマークがないことは、研究のノベルティを評価する能力の体系的評価に限られている。このギャップに対処するために、人間のピアレビューをサポートする新規性評価を生成するLLMの能力を評価するために設計された最初の大規模ベンチマークであるNovBenchを紹介する。 NovBenchは、論文紹介から抽出したノベルティ記述や、それに対応する専門家によるノベルティ評価を含む、主要なNLPカンファレンスから1,684の論文レビューペアで構成されている。ノベルティ・クレームの標準化された明示的な記述を提供するのに対し、専門家によるノベルティ評価は人間の判断の現在のゴールドスタンダードの1つである。さらに, LLM生成ノベルティ評価の質を評価するための4次元評価フレームワーク(妥当性, 正確性, 包括性, 明度を含む)を提案する。異なるプロンプト戦略の下での汎用LLMと専門LLMの広範な実験により、現在のモデルでは科学的なノベルティの理解が限られており、微調整されたモデルはしばしば命令追従の欠陥に悩まされることが判明した。これらの知見は,新規性理解と指導の順守を両立させる微調整戦略の必要性を浮き彫りにした。

論文の概要: NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

関連論文リスト