Fugu-MT 論文翻訳(概要): Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

論文の概要: Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

arxiv url: http://arxiv.org/abs/2510.26027v1
Date: Wed, 29 Oct 2025 23:50:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-31 16:05:09.608215
Title: Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders
Title（参考訳）: 視覚エンコーダにおける重み付き時間注意によるビデオLLMの時間的理解の促進
Authors: Ali Rasekh, Erfan Bagheri Soula, Omid Daliran, Simon Gottschalk, Mohsen Fayyaz,
Abstract要約: 本稿では,視覚エンコーダ内に直接重畳された時間的注意モジュールを導入したビデオLLMアーキテクチャを提案する。この設計では、視覚エンコーダの時間的注意が組み込まれており、モデルがアクションの進行とフレーム間の関係をよりよく捉えることができる。その結果,本手法は時間的推論を著しく改善し,ビデオ質問応答タスクにおける既存モデルよりも優れることがわかった。
参考スコア（独自算出の注目度）: 9.162827706080337
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)の大幅な進歩にもかかわらず、ビデオにおける複雑な時間的ダイナミクスを理解することは大きな課題である。実験の結果,現在のビデオ大言語モデル(Video Large Language Model, Video-LLM)アーキテクチャは時間的理解に限界があり,アクションシーケンスや時間的進行の詳細な理解を必要とするタスクに悩まされていることがわかった。本研究では,視覚エンコーダ内に直接重畳された時間的注意モジュールを導入したビデオLLMアーキテクチャを提案する。この設計では、視覚エンコーダの時間的注意が組み込まれており、LLMに視覚トークンを渡す前に、モデルがアクションの進行とフレーム間の関係をよりよく捉えることができる。提案手法は,ビデオ質問応答タスク,特に行動認識において,時間的推論を大幅に改善し,既存のモデルよりも優れることを示す。 VITATECS、MVBench、Video-MMEなどのベンチマークを最大5.5%改善する。視覚エンコーダを時間的構造で拡張することにより,ビデオLLMの映像理解における重要なギャップを解消する。プロジェクトページとコードは、https://alirasekh.github.io/STAVEQ2/.com/で公開されている。

論文の概要: Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

関連論文リスト