Fugu-MT 論文翻訳(概要): Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

論文の概要: Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

arxiv url: http://arxiv.org/abs/2511.14446v1
Date: Tue, 18 Nov 2025 12:43:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:53.116353
Title: Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding
Title（参考訳）: Agentic Video Intelligence: 高度なビデオ探索と理解のための柔軟なフレームワーク
Authors: Hong Gao, Yiming Bao, Xuezhen Tu, Yutong Xu, Yue Jin, Yiyang Mu, Bin Zhong, Linan Yue, Min-Ling Zhang,
Abstract要約: 本稿では,システムレベルの設計と最適化によって人間の映像理解を反映できるフレキシブルでトレーニング不要なフレームワークであるエージェントビデオインテリジェンス(AVI)を提案する。 AVIは、(1)人間にインスパイアされた3相推論プロセス(Retrieve-Perceive-Review)、(2)エンティティグラフによって構成された構造化ビデオ知識ベース、(3)軽量CVモデルとVLMを組み合わせたオープンソースのモデルアンサンブルの3つの重要なイノベーションを紹介している。
参考スコア（独自算出の注目度）: 43.785571875867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.
Abstract（参考訳）: ビデオ理解には、視覚認識だけでなく、複雑な推論も必要である。 VLM(Vision-Language Models)は印象的な能力を示しているが、一般的にはビデオの処理はシングルパス方式で行う。最近登場したエージェントベースの手法は、長距離推論を可能にするが、高価なプロプライエタリなモデルに大きく依存するか、広範囲なエージェントRLトレーニングを必要とする。このような制限を克服するために,システムレベルの設計と最適化によって人間の映像理解を反映できるフレキシブルでトレーニングのないフレームワークであるAVI(Agentic Video Intelligence)を提案する。 AVIは,(1) 十分なグローバルな探索と局所分析を両立させる3段階推論プロセス(Retrieve-Perceive-Review),(2) エンティティグラフによって構成された構造化ビデオ知識ベース,(2) エージェントのインタラクション環境を構成する多言語統合ツール,(3) 軽量なCVモデルとVLMによる推論を組み合わせたオープンソースモデルアンサンブル,そして,プロプライエタリなAPIやRLトレーニングへの依存を排除した。 LVBench、VideoMME-Long、LongVideoBench、Charades-STAの実験は、AVIが優れた解釈性を提供しながら競争力を発揮することを示した。

論文の概要: Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

関連論文リスト