Fugu-MT 論文翻訳(概要): Understanding DeepResearch via Reports

論文の概要: Understanding DeepResearch via Reports

arxiv url: http://arxiv.org/abs/2510.07861v1
Date: Thu, 09 Oct 2025 07:03:43 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:14.919144
Title: Understanding DeepResearch via Reports
Title（参考訳）: レポートによるDeepResearchの理解
Authors: Tianyu Fan, Xinyao Niu, Yuxiang Zheng, Fengji Zhang, Chengen Huang, Bei Chen, Junyang Lin, Chao Huang,
Abstract要約: DeepResearchは、高度な推論とマルチツール統合を通じて専門家レベルの研究を行う、変革的なAIパラダイムである。これらのシステムを評価することは、オープンな研究シナリオと、独立した機能に焦点を当てた既存のベンチマークのため、依然として極めて難しい。 DeepResearch-ReportEvalは、DeepResearchシステムを最も代表的なアウトプットで評価するための総合的なフレームワークである。
参考スコア（独自算出の注目度）: 41.60038455664918
License: http://creativecommons.org/licenses/by/4.0/
Abstract: DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: https://github.com/HKUDS/DeepResearch-Eval.
Abstract（参考訳）: DeepResearchエージェントは、高度な推論とマルチツール統合を通じて専門家レベルの研究を行う、変革的なAIパラダイムを表している。しかし、これらのシステムを評価することは、総合的な性能よりも孤立した能力に焦点を当てた、オープンな研究シナリオと既存のベンチマークのため、依然として重大な課題である。従来のLLMタスクとは異なり、DeepResearchシステムは多様なソースを合成し、洞察を生成し、単純な検証に抵抗する機能である一貫性のある発見を提示する必要がある。このギャップに対処するために、DeepResearch-ReportEvalを紹介します。提案手法は, 3次元の質, 冗長性, 事実性の3次元を, 専門家の強い一致を達成する革新的なLCM-as-a-Judge手法を用いて体系的に測定する。我々は、12の現実世界のカテゴリにまたがる100のキュレートされたクエリの標準ベンチマークにコントリビュートし、体系的な能力比較を可能にした。我々は,DeepResearchが情報アシスタントからインテリジェントな研究パートナーへと進化していくにつれ,設計哲学と業績トレードオフの相違が明らかになった。ソースコードとデータは、https://github.com/HKUDS/DeepResearch-Eval.comで入手できる。

論文の概要: Understanding DeepResearch via Reports

関連論文リスト