Fugu-MT 論文翻訳(概要): Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

論文の概要: Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

arxiv url: http://arxiv.org/abs/2605.29861v2
Date: Wed, 03 Jun 2026 08:03:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 17:40:41.557252
Title: Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
Title（参考訳）: 検証可能なマルチモーダルディープリサーチに向けて:インターリーブレポート生成のためのマルチエージェント・ハーネス
Authors: Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao, Xiaoxi Li, Zhicheng Dou,
Abstract要約: レポート生成のためのマルチエージェントハーネスであるPtahを提案する。 Ptahは計画、研究、執筆段階を通じて、ユーザクエリからレンダリングされたWebレポートまでのライフサイクルを編成する。検証エージェントがハーネスの受け入れ機能として機能し、ワークフロー全体を通して事実的接地、引用の忠実性、相互の整合性を強制する。
参考スコア（独自算出の注目度）: 74.0621258662676
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi-agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce PtahEval, an evaluation protocol that augments existing benchmarks with image-level and presentation-level assessments. Experiments on deep research benchmarks show that Ptah produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines. Our code is released at https://github.com/SnowNation101/Ptah
Abstract（参考訳）: 大規模言語モデル(LLM)は、詳細な事実の答えを検索するディープサーチから、散在する証拠をロングフォームなレポートに合成するディープリサーチまで、高度な自律エージェントを持っている。しかし、決定論的根拠のないオープン・エンド・シンセサイザーと、視覚的証拠とテキストの議論をインターリーブする必要があるため、検証可能なマルチモーダル・ディープ・リサーチは依然として困難である。レポート生成のためのマルチエージェントハーネスであるPtahを提案する。 Ptahは、ユーザクエリからレンダリングされたWebレポートまでのライフサイクルを、計画、調査、執筆段階を通じて編成する。特殊なエージェントが視覚的に認識された計画を構築し、クレームを根拠とした証拠を収集し、Visual Working Memory内のソース整列したイメージを保持し、宣言的なマルチモーダルツールの使用を通じてレポートを構成する。検証エージェントがハーネスの受け入れ機能として機能し、ワークフロー全体を通して事実的接地、引用の忠実性、相互の整合性を強制する。さらに、PtahEvalは、既存のベンチマークを画像レベルおよびプレゼンテーションレベルの評価で強化する評価プロトコルである。ディープ・リサーチ・ベンチマークの実験では、Ptahは強力なベースラインよりも信頼性が高く、視覚的にも有意義で、使用可能なマルチモーダル・レポートを生み出している。私たちのコードはhttps://github.com/SnowNation101/Ptahでリリースされています。

論文の概要: Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

関連論文リスト