Fugu-MT 論文翻訳(概要): DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

論文の概要: DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

arxiv url: http://arxiv.org/abs/2508.20033v1
Date: Wed, 27 Aug 2025 16:36:34 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-28 19:07:41.710051
Title: DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
Title（参考訳）: DeepScholar-Bench: 生成的研究合成のためのライブベンチマークと自動評価
Authors: Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, Carlos Guestrin,
Abstract要約: 本稿では,生のベンチマークと総合的自動評価フレームワークであるDeepScholar-benchを紹介する。 DeepScholar-benchは、最近の高品質なArXiv論文からクエリを抽出し、真の研究合成タスクにフォーカスしている。また,LOTUS APIを用いて効率的に実装した参照パイプラインであるDeepScholar-baseを開発した。
参考スコア（独自算出の注目度）: 52.636738269442766
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. Our evaluation framework holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI's, OpenAI's DeepResearch, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than each other method. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of $19\%$ across all metrics. These results underscore the difficulty of DeepScholar-bench, as well as its importance for progress towards AI systems capable of generative research synthesis. We make our code available at https://github.com/guestrin-lab/deepscholar-bench.
Abstract（参考訳）: 知識を研究し、合成する能力は、人間の専門知識と進歩の中心である。新たなタイプのシステムでは、生成的な研究合成、ライブウェブ上での検索、発見されたソースを長めの要約に合成することで、これらのエキサイティングな機能を約束している。しかし、そのようなシステムを評価することはオープンな課題であり、既存の質問回答ベンチマークは短文の事実応答に重点を置いている。どちらも、実際の研究合成タスクの複雑さと進化の性質を捉えられなかった。本稿では,生のベンチマークと総合的自動評価フレームワークであるDeepScholar-benchを紹介する。 DeepScholar-benchは、最近の高品質なArXiv論文からの問い合わせを抽出し、論文の関連作業セクションを検索、合成、引用することで、実際の研究合成タスクに焦点を当てている。評価フレームワークは,3つの重要な側面,知識合成,検索品質,妥当性を総合的に評価する。また,LOTUS APIを用いて効率的に実装した参照パイプラインであるDeepScholar-baseを開発した。 DeepScholar-benchフレームワークを使用して、従来のオープンソースシステム、検索AI、OpenAIのDeepResearch、DeepScholar-baseを体系的に評価する。 DeepScholar-baseは強力なベースラインを確立し、互いに競争力や高いパフォーマンスを実現しています。また、DeepScholar-benchは飽和状態には程遠いため、すべてのメトリクスに対して19セントのスコアを超えるシステムは存在しない。これらの結果は、DeepScholar-benchの難しさと、生成的な研究合成が可能なAIシステムへの進歩の重要性を浮き彫りにしている。コードはhttps://github.com/guestrin-lab/deepscholar-bench.comで公開しています。

論文の概要: DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

関連論文リスト