Fugu-MT 論文翻訳(概要): ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

論文の概要: ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

arxiv url: http://arxiv.org/abs/2605.13228v1
Date: Wed, 13 May 2026 09:19:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.938729
Title: ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
Title（参考訳）: ReTool-Video: メタ強化ツールによるビデオエージェントの再帰ツール
Authors: Xiao Liu, Nayu Liu, Junnan Zhu, Ruirui Chen, Guohui Xiang, Changjian Wang, Kaiwen Wei, Rongzhen Li, Jiang Zhong,
Abstract要約: ビデオ理解には、時間的推論、クロスモーダル理解、複雑な質問応答のためのツール強化ビデオエージェントの探索、動機付け、積極的なエビデンスが必要である。既存のビデオエージェントは、検索、メモリ、フレーム検査、検証ツールでビデオ推論を改善したが、それでも2つの制限に直面している。本稿では,これらの課題を2つの相補的な設計で解決する。
参考スコア（独自算出の注目度）: 30.50286834537385
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.
Abstract（参考訳）: ビデオ理解には、時間的推論、クロスモーダル理解、複雑な質問応答のためのツール強化ビデオエージェントの探索、動機付け、積極的なエビデンスが必要である。既存のビデオエージェントは、検索、メモリ、フレーム検査、検証ツールによるビデオ推論を改善しているが、(1)合成推論の細かい操作を欠いた粗いツールスペース、(2)高レベルの映像意図を原始的なツールコールに強制するフラットアクションスペースの2つの制限に直面している。本稿では,これらの課題を2つの相補的な設計で解決する。まず,MetaAug-Video Tool Library (MVTL) を構築した。MetaAug-Video Tool Library (MVTL) は,汎用マルチモーダル信号処理のための26のベースツールと,フィルタリング,アグリゲーション,リグレード,フォーマッティング,その他の中間再帰操作のための108のメタツールを含む,134の登録ツールを備えた拡張可能なツールライブラリである。 MVTLは、構造化されたビデオ情報と生のモーダルエビデンスの両方へのデュアルレベルアクセスをサポートし、多様なビデオ推論シナリオを可能にする。第2にReTool-Videoを提案する。ReTool-Videoは,高レベルの映像意図を実行可能なツールチェーンに基盤付ける再帰的ツール利用手法である。 ReTool-Videoでは、一致したアクションは直接実行されるが、未一致のインテントはパラメータの修復、ツール置換、あるいは分解のためにリゾルバに委譲される。これにより、時間的マージやクロスモーダル検証、繰り返しイベントアグリゲーションといった抽象的なアクションを、実行時に具体的なマルチモーダル操作に段階的に変換することができる。 MVBench, MLVU, Video-MME w/osubの実験 ReTool-Videoは、一貫して強力なベースラインを上回っている。さらなる分析により、再帰的接地と微細なメタツールは、複雑なビデオ理解の安定性と有効性を向上させることが示されている。

論文の概要: ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

関連論文リスト