Fugu-MT 論文翻訳(概要): A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

論文の概要: A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

arxiv url: http://arxiv.org/abs/2603.14733v1
Date: Mon, 16 Mar 2026 02:09:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.996902
Title: A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding
Title（参考訳）: マルチビデオ理解のためのスキル強化型エージェントフレームワークとベンチマーク
Authors: Yue Zhang, Liqiang Jing, Jia Li, Yapeng Tian, Xinya Du, Yunhui Guo, Vibhav Gogate,
Abstract要約: マルチモーダルな大規模言語モデルはシングルビデオ理解において高いパフォーマンスを達成しているが、複数のビデオにまたがる推論能力は依然として限られている。既存のアプローチでは、複数のビデオを1つの入力にまとめて直接推論を行い、トレーニングと推論のミスマッチを導入する。現在のマルチビデオベンチマークでは、主にイベントレベルの比較を強調しており、アイデンティティレベルのマッチング、きめ細かい識別、構造化されたマルチステップ推論が過小評価されている。視覚ツール,タスク固有のスキル,コンフリクト対応検証機構を統合した,多視点理解のためのスキル強化型エージェントフレームワークSAMAを提案する。
参考スコア（独自算出の注目度）: 69.31609753061137
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.
Abstract（参考訳）: マルチモーダルな大規模言語モデルはシングルビデオ理解において高いパフォーマンスを達成しているが、複数のビデオにまたがる推論能力は依然として限られている。既存のアプローチでは、複数のビデオを単一の入力にまとめて直接推論し、トレーニングと推論のミスマッチ、フレーム圧縮からの情報損失、明示的なクロスビデオ調整の欠如を導入している。一方、現在のマルチビデオベンチマークでは、主にイベントレベルの比較を強調しており、アイデンティティレベルのマッチング、きめ細かい識別、構造化されたマルチステップ推論が過小評価されている。 MVX-Benchは、11の古典的コンピュータビジョンタスクを、さまざまな実世界のデータセットから4,255本のビデオに対して1,442の質問を含む、統一されたマルチビデオ質問応答フレームワークに再構成する、マルチビデオクロスディメンジョンベンチマークである。さらに、視覚ツール、タスク固有のスキルを統合したマルチビデオ理解のためのスキル強化エージェントフレームワークSAMAと、反復的かつ構造化された推論を可能にするコンフリクト認識検証機構を提案する。実験の結果, SAMA は MVX-Bench 上でのオープンソースベースラインや GPT よりも優れており, スキル設計とコンフリクト解決の有効性が検証された。

論文の概要: A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

関連論文リスト