Fugu-MT 論文翻訳(概要): Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

論文の概要: Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

arxiv url: http://arxiv.org/abs/2605.17554v1
Date: Sun, 17 May 2026 17:32:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:48.188072
Title: Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
Title（参考訳）: 専門家相談作業における深層調査エージェントの評価--検証者,ラグビー選手,認知的軌跡とのベンチマーク
Authors: Tanmay Asthana, Aman Saksena, Divyansh Sahu,
Abstract要約: Frontier Deep Research Agent(DRA)は、研究タスクを計画し、文書を合成し、必要に応じて構造化された成果を返却する。既存のベンチマークは、ファクトリコール、シングルホップQA、ジェネリックエージェントスキルを計測する。私たちは、Web検索を備えたClaude Opus 4.6、OpenAI o3-deep-research、Google Gemini 3.1 Proの3つのフロンティアエージェントを、42の中小企業のプロンプトで評価しています。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed cognitive traps that penalize surface-pattern matching. Acceptance under our joint threshold (rubric mean >= 2.5 and verifier rate >= 80%) is uniformly low: Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS scores agree with published rubric-based benchmarks (our top 62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics < 68%), validating the rubric construct. ACCEPT rates sit below APEX-Agents' MC-segment Pass@1 band (12.3-22.7%) on dedicated DR agents; our floor is three points lower despite the harness advantage, opened by stricter conjunctive grading and trap design. Each agent fails distinctively. Claude produces the deliverable most reliably (4.5x the others' rate on file-required tasks) but carries the highest fabrication signature. o3 has the cleanest reasoning average yet drops required sections and propagates arithmetic errors. Gemini is bimodal, with the highest acceptance rate alongside the most zero-scored rubric cells.
Abstract（参考訳）: Frontier Deep Research Agent(DRA)は、研究タスクを計画し、文書を合成し、必要に応じて構造化された成果を返却する。それらは評価されるよりも早く、エンタープライズワークフローにデプロイされます。既存のベンチマークは、ファクトリコール、シングルホップQA、ジェネリックエージェントスキルを計測する。我々は、経営コンサルタントの典型的な週を埋める構造化された分析的成果物をターゲットにしたベンチマークを導入する。私たちは、Web検索を備えたClaude Opus 4.6、OpenAI o3-deep-research、Google Gemini 3.1 Proの3つのフロンティアエージェントを、42の中小企業のプロンプトで評価します。それぞれの126の応答は、決定論的グラウンドトゥルース検証器(平均1タスクあたり13.8)と、0-100のベリヤ・ルーブリックスコア(VRS)を構成する5つの基準の0-3 SMEルーブリックの2つの層にスコアされる。ほとんどのプロンプトは、表面パターンマッチングを罰する認知トラップを埋め込む。我々のジョイントしきい値(ルブリック平均:2.5、検証率:80%)での受け入れは、一様に低く、Gemini 21.4%、o3 9.5%、Claude 9.5%である。平均的なVRSスコアはルーブリックベースのベンチマーク(トップ62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics < 68%)と一致し、ルーブリックの構成を検証する。 ACCEPTレートはAPEX-Agentsの専用DRエージェントのMC-segment Pass@1バンド(12.3-22.7%)よりも低い。各エージェントは特有に失敗する。クロードは最も確実に納品できる(ファイル要求タスクの4.5倍)が、最も高い製造シグネチャを持っている。 o3は、最もクリーンな推論平均を持つが、必要なセクションをドロップし、算術誤差を伝搬する。ジェミニはバイモーダルであり、最も高い受容率と最もゼロ色のルーブリック細胞とを併せ持つ。

論文の概要: Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

関連論文リスト