Fugu-MT 論文翻訳(概要): Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

論文の概要: Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

arxiv url: http://arxiv.org/abs/2603.29252v1
Date: Tue, 31 Mar 2026 04:23:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:03.147891
Title: Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
Title（参考訳）: 視覚記憶機構を用いた多モーダル大言語モデルの長時間映像理解のスケールアップ
Authors: Tao Chen, Kun Zhang, Qiong Wu, Xiao Chen, Chao Chang, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji,
Abstract要約: EmphMultimodal Large Language Models(MLLM)の進歩に悩む長大なビデオ理解本稿では, 視覚記憶機構の観点からこの問題を考察し, emphFlexible Memory (textbfFlexMem) と呼ばれる新しい, トレーニング不要な手法を提案する。原則としてFlexMemは、ビデオ視聴の人間の振る舞いを模倣すること、すなわち、継続的にビデオコンテンツを見て、最も関連するメモリフラグメントをリコールして質問に答えることを目的としている。
参考スコア（独自算出の注目度）: 82.67996027633986
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.
Abstract（参考訳）: 長いビデオ理解は、MLLMの進歩を悩ませる重要な課題である。本稿では,視覚記憶機構の観点からこの問題を考察し,新しい学習自由な手法である「emph{Flexible Memory} (\textbf{FlexMem})」を提案する。原則として、FlexMemはビデオ視聴の人間の振る舞いを模倣することを目的としており、ビデオコンテンツを継続的に視聴し、最も関連性の高いメモリフラグメントをリコールして質問に答えることを目的としている。このような方法でFlexMemは、すべてのビデオ情報を一度に処理し、入力上限を持つ従来の方法とは異なり、MLLMが無限長のビデオ理解を実現するのに役立ちます。具体的には、FlexMemはまず視覚的なKVキャッシュをメモリソースとみなし、双方向圧縮設計による効果的なメモリ転送と書き込みを実現する。その後FlexMemは、人気のあるストリーミング機能を含む様々なビデオ理解タスクのための異なるメモリ読み込み戦略についても検討している。 FlexMemを検証するために、2つの人気ビデオMLLMに適用し、5つの長ビデオと1つのストリーミングビデオタスクで広範な実験を行う。実験結果から,FlexMemは既存の効率的なビデオ理解手法やプロセスよりも明らかな改善を達成でき,また,いくつかのベンチマーク(emph{e g }, GPT-4o, Gemini-1.5 Pro)において,ベースMLLMがSOTA MLLMと同等あるいはそれ以上のパフォーマンスを達成するのに役立つことがわかった。

論文の概要: Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

関連論文リスト