Fugu-MT 論文翻訳(概要): DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding

論文の概要: DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding

arxiv url: http://arxiv.org/abs/2511.11552v1
Date: Fri, 14 Nov 2025 18:42:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-17 22:42:18.760311
Title: DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
Title（参考訳）: DocLens : 長期視覚文書理解のためのツール強化マルチエージェントフレームワーク
Authors: Dawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li, Tomas Pfister, Jinsung Yoon,
Abstract要約: 我々は、レンズのようなエビデンスに対して「効果的にズームインする」ツール強化マルチエージェントフレームワークであるDocLensを提案する。最初はドキュメント全体から、関連するページの特定のビジュアル要素にナビゲートし、次にサンプリング・アジュディテーション機構を使用して、信頼できる1つの回答を生成する。 MMLongBench-DocとFinRAG-Vで最先端のパフォーマンスを達成し、人間専門家さえ超えている。
参考スコア（独自算出の注目度）: 59.4112754806335
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.
Abstract（参考訳）: 情報をテキストや視覚要素の広範なページに分散する長い視覚文書を補完することは、現代の視覚言語モデル(VLM)にとって重要な課題である。既存のアプローチは、エビデンスローカライゼーション(エビデンスローカライゼーション)という、基本的な課題に固執する。彼らは関連ページの検索に苦労し、視覚要素の細かい詳細を見落とし、パフォーマンスとモデル幻覚に繋がる。そこで我々は,レンズのようなエビデンスに対して,効果的に ‘zooms in’ を行うツール拡張マルチエージェントフレームワークであるDocLensを提案する。最初はドキュメント全体から、関連するページの特定のビジュアル要素にナビゲートし、次にサンプリング・アジュディテーション機構を使用して、信頼できる1つの回答を生成する。 Gemini-2.5-Proと組み合わせて、DocLensはMMLongBench-DocとFinRAGBench-Vで最先端のパフォーマンスを達成し、人間専門家でさえ超えた。このフレームワークの優位性は、特に視覚中心のクエリと不可解なクエリに顕著であり、そのローカライゼーション能力の強化の力を示している。

論文の概要: DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding

関連論文リスト