Fugu-MT 論文翻訳(概要): Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow

論文の概要: Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow

arxiv url: http://arxiv.org/abs/2510.05836v1
Date: Tue, 07 Oct 2025 12:01:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:08.235682
Title: Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow
Title（参考訳）: Flow4Agent:光学フローに先立つ動きによる長めのビデオ理解
Authors: Ruyang Liu, Shangkun Sun, Haoran Tang, Ge Li, Wei Gao,
Abstract要約: Flow4Agentは、ビデオの長い理解を容易にするために、光学フローからの動作先を組み込んだ新しいフレームワークである。 Flow4Agentは、2つのコアモジュールを通して、時間的および空間的に長いビデオの冗長性を緩和する。大規模な実験により、Flow4Agentは、幅広いビデオMLLMベンチマークで既存の手法よりも優れています。
参考スコア（独自算出の注目度）: 28.538115156420645
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-form video understanding has always been a challenging problem due to the significant redundancy in both temporal and spatial contents. This challenge is further exacerbated by the limited context length of Multimodal Large Language Models (MLLMs). To address this issue, many previous works have attempted to extract key video information, where the "key" is typically semantic-aware and heavily dependent on the CLIP model as prior. In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. Flow4Agent mitigates the redundancy in long videos at both temporal and spatial levels through two core modules: Temporal Granularity Optimization (TGO) adaptively refines framelevel hierarchies, which first leverages coarse flow priors to group similar visual contents and then applies semantic priors to filter out highly irrelevant scene information. Motion Token Pruning (MTP) further refines the intra-frame visual representations, pruning high-redundancy video tokens using fine-grained optical flow information. Extensive experiments demonstrate that our Flow4Agent outperforms existing methods across a wide range of video MLLM benchmarks, especially for hour-level video understanding tasks, achieving 64.7% on Video-MME, 71.4% on MLVU and 60.4% on LongVideoBench.
Abstract（参考訳）: 時間的・空間的コンテンツに有意な冗長性があるため、長めのビデオ理解は常に難しい問題である。この課題は、MLLM(Multimodal Large Language Models)のコンテキスト長の制限によってさらに悪化する。この問題に対処するため、多くの先行研究がキービデオ情報を抽出しようと試みており、「キー」は通常セマンティック・アウェアで、以前のCLIPモデルに大きく依存している。本稿では,LLMに基づく長時間ビデオ理解を容易にするために,光フローからの動作先を先駆的に組み込んだ新しいフレームワークであるFlow4Agentを提案する。 Flow4Agentは2つのコアモジュールを通して、時間的および空間的両方のビデオの冗長性を緩和する: 時間的粒度最適化(TGO)は、フレームレベルの階層を適応的に洗練する。モーション・トーケン・プルーニング(MTP)はフレーム内の視覚表現をさらに洗練し、微細な光フロー情報を用いて高輝度ビデオトークンをプルーニングする。 Flow4Agentは、ビデオMLLMベンチマーク、特に時間レベルのビデオ理解タスクにおいて、既存の手法よりも優れており、ビデオMMEでは64.7%、MLVUでは71.4%、LongVideoBenchでは60.4%を達成している。

論文の概要: Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow

関連論文リスト