Fugu-MT 論文翻訳(概要): Multimodal Long Video Modeling Based on Temporal Dynamic Context

論文の概要: Multimodal Long Video Modeling Based on Temporal Dynamic Context

arxiv url: http://arxiv.org/abs/2504.10443v1
Date: Mon, 14 Apr 2025 17:34:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-04-22 21:45:55.02797
Title: Multimodal Long Video Modeling Based on Temporal Dynamic Context
Title（参考訳）: 時間的動的文脈に基づくマルチモーダルロングビデオモデリング
Authors: Haoran Hao, Jiaming Han, Yiyuan Zhang, Xiangyu Yue,
Abstract要約: 時間的動的コンテキスト(TDC)と呼ばれるフレーム間の時間的関係を利用した動的長ビデオ符号化手法を提案する。ビデオはフレーム間の類似性に基づいて意味的に一貫したシーンに分割し、各フレームを視覚音響エンコーダを使用してトークンにエンコードする。極端に長いビデオを扱うために,複数のビデオセグメントから回答を段階的に抽出する学習自由連鎖戦略を提案する。
参考スコア（独自算出の注目度）: 13.979661295432964
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.
Abstract（参考訳）: 近年のLarge Language Models (LLMs) の進歩は、ビデオ理解に大きなブレークスルーをもたらした。しかし、既存のモデルは、LLMのコンテキスト長制約とビデオ内の膨大な情報のために、長いビデオ処理に苦慮している。最近の手法は長いビデオ理解のために設計されているが、トークン圧縮時に重要な情報を失い、オーディオのような追加のモダリティに悩まされることがしばしばある。本研究では,時間的動的コンテキスト(TDC)と呼ばれるフレーム間の時間的関係を利用した動的長ビデオ符号化手法を提案する。まず、フレーム間の類似性に基づいて映像を意味的に一貫したシーンに分割し、次に視覚音響エンコーダを用いて各フレームをトークンにエンコードする。次に,各セグメント内のトークン数を削減できる新しい時間的文脈圧縮器を提案する。具体的には、クエリベースのTransformerを使用して、ビデオ、オーディオ、および命令文トークンを時間的コンテキストトークンの限定セットに集約する。最後に,静的フレームトークンと時間的コンテキストトークンをビデオ理解のためにLLMに供給する。さらに,極端に長いビデオを扱うために,複数のビデオセグメントから回答を段階的に抽出する学習自由連鎖戦略を提案する。これらの中間回答は推論プロセスの一部として機能し、最終的な回答に寄与する。我々は,一般的な映像理解と音声・映像理解のベンチマークについて広範な実験を行い,その手法は高い性能を示す。コードとモデルはhttps://github.com/Hoar012/TDC-Videoで公開されている。

論文の概要: Multimodal Long Video Modeling Based on Temporal Dynamic Context

関連論文リスト