Fugu-MT 論文翻訳(概要): MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

論文の概要: MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

arxiv url: http://arxiv.org/abs/2603.28407v1
Date: Mon, 30 Mar 2026 13:16:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.407044
Title: MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
Title（参考訳）: MiroEval: プロセスとアウトカムにおけるマルチモーダルディープリサーチエージェントのベンチマーク
Authors: Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, Lidong Bing,
Abstract要約: MiroEvalはディープリサーチシステムのベンチマークおよび評価フレームワークである。ベンチマークは、実際のユーザニーズに基づいて100のタスクで構成されている。提案した評価スイートは3つの相補的な次元に沿って深層研究システムを評価する。
参考スコア（独自算出の注目度）: 109.15093810810214
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.
Abstract（参考訳）: 近年のディープリサーチシステムの進歩は目覚ましいが、実際のユーザニーズの遅れは依然として評価が遅れている。既存のベンチマークは、主に固定されたルーリックを使用して最終報告を評価し、基礎研究プロセスの評価に失敗した。ほとんどの場合、マルチモーダルなカバレッジが制限されており、実際のクエリの複雑さを反映せず、知識が進化するにつれてリフレッシュできない合成タスクに依存している。これらのギャップに対処するため、深層研究システムのためのベンチマークおよび評価フレームワークであるMiroEvalを紹介した。ベンチマークは100のタスク(70のテキストのみ、30のマルチモーダル)で構成され、すべて実際のユーザニーズに基づいて構築され、定期的な更新をサポートするデュアルパスパイプラインを通じて構築され、ライブかつ進化した設定を可能にする。提案した評価スイートは,3つの相補的な次元に沿って深層研究システムを評価する。タスク固有のルーリックを用いた適応的合成品質評価,Webソースとマルチモーダルアタッチメントの両方に対するアクティブ検索と推論によるエージェント的事実性検証,およびプロセス中心評価は,調査全体を通してシステムがどのように検索,理由,洗練されているかを評価する。 3つの評価次元は、システム能力の相補的な側面を捉え、それぞれ異なる強みと弱点を明らかにし、プロセス品質は、アウトプットレベルのメトリクスに見えない弱点を明らかにしながら、全体的な結果の信頼性の高い予測要因として機能し、マルチモーダルタスクは、ほとんどのシステムが3から10ポイント減少する、という3つの主要な結果をもたらす。 MiroThinkerシリーズは最もバランスの取れたパフォーマンスを達成し、MiroThinker-H1は両方の設定で総合的に最高位にランクインした。人間による検証と堅牢性の結果は、ベンチマークと評価フレームワークの信頼性を確認します。 MiroEvalは、次世代のディープリサーチエージェントのための総合的な診断ツールを提供する。

論文の概要: MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

関連論文リスト