Fugu-MT 論文翻訳(概要): Episodic Memory Representation for Long-form Video Understanding

論文の概要: Episodic Memory Representation for Long-form Video Understanding

arxiv url: http://arxiv.org/abs/2508.09486v1
Date: Wed, 13 Aug 2025 04:33:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-14 20:42:00.759078
Title: Episodic Memory Representation for Long-form Video Understanding
Title（参考訳）: 長期映像理解のためのエピソード記憶表現法
Authors: Yun Wang, Long Zhang, Jingren Liu, Jiaqi Yan, Zhanjie Zhang, Jiahao Zheng, Xun Yang, Dapeng Wu, Xiangyu Chen, Xuelong Li,
Abstract要約: 大きなビデオ言語モデルは、一般的なビデオ理解において優れているが、長い形式のコンテキストウィンドウの制限に苦労する。人間の記憶の原理にインスパイアされたトレーニングフリーのフレームワークであるVideo-EMを紹介する。 Video-EMでは、各ベースラインに対して4-9%のパフォーマンス向上を実現し、フレームの削減を実現している。
参考スコア（独自算出の注目度）: 52.33907540905242
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Large Language Models (Video-LLMs) excel at general video understanding but struggle with long-form videos due to context window limits. Consequently, recent approaches focus on keyframe retrieval, condensing lengthy videos into a small set of informative frames. Despite their practicality, these methods simplify the problem to static text image matching, overlooking spatio temporal relationships crucial for capturing scene transitions and contextual continuity, and may yield redundant keyframes with limited information, diluting salient cues essential for accurate video question answering. To address these limitations, we introduce Video-EM, a training free framework inspired by the principles of human episodic memory, designed to facilitate robust and contextually grounded reasoning. Rather than treating keyframes as isolated visual entities, Video-EM explicitly models them as temporally ordered episodic events, capturing both spatial relationships and temporal dynamics necessary for accurately reconstructing the underlying narrative. Furthermore, the framework leverages chain of thought (CoT) thinking with LLMs to iteratively identify a minimal yet highly informative subset of episodic memories, enabling efficient and accurate question answering by Video-LLMs. Extensive evaluations on the Video-MME, EgoSchema, HourVideo, and LVBench benchmarks confirm the superiority of Video-EM, which achieves highly competitive results with performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
Abstract（参考訳）: ビデオ大言語モデル (Video Large Language Models, Video-LLMs) は、一般的なビデオ理解に優れるが、コンテキストウィンドウの制限により長めのビデオに苦しむ。その結果、近年のアプローチでは、キーフレームの検索に焦点が当てられ、長いビデオを小さな情報フレームに凝縮する。その実用性にもかかわらず、これらの手法は静的テキスト画像マッチングの問題を単純化し、シーンの遷移と文脈的連続性を捉えるのに不可欠な時空間的関係を見落とし、限られた情報を持つ冗長なキーフレームを生成し、正確なビデオ質問応答に不可欠な有能なキューを希釈する。これらの制約に対処するために,人間のエピソード記憶の原理に触発されたトレーニングフリーフレームワークであるVideo-EMを導入する。キーフレームを独立した視覚的実体として扱うのではなく、ビデオ-EMはそれらを時間的に順序付けられたエピソード事象として明確にモデル化し、基礎となる物語を正確に再構築するのに必要な空間的関係と時間的ダイナミクスの両方を捉えた。さらに、このフレームワークは、LLMによる思考の連鎖(CoT)を利用して、最小でも高情報性の高いエピソード記憶の部分集合を反復的に識別し、ビデオLLMによる効率的な正確な質問応答を可能にする。 Video-MME, EgoSchema, HourVideo, LVBenchベンチマークの大規模評価により, Video-EMの優位性が確認された。

論文の概要: Episodic Memory Representation for Long-form Video Understanding

関連論文リスト