Fugu-MT 論文翻訳(概要): KFFocus: Highlighting Keyframes for Enhanced Video Understanding

論文の概要: KFFocus: Highlighting Keyframes for Enhanced Video Understanding

arxiv url: http://arxiv.org/abs/2508.08989v1
Date: Tue, 12 Aug 2025 14:57:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-13 21:07:34.467769
Title: KFFocus: Highlighting Keyframes for Enhanced Video Understanding
Title（参考訳）: KFFocus: 高度なビデオ理解のためのハイライトキーフレーム
Authors: Ming Nie, Chunwei Wang, Hang Xu, Li Zhang,
Abstract要約: KFFocusは,ビデオトークンを効率よく圧縮し,映像フレーム内に存在する情報的コンテキストを強調する手法である。 KFFocusは、コンテキスト関連性に基づいてフレームに様々な凝縮率を割り当てることで、情報コンテンツの詳細を保存しつつ、トークンの冗長性を効率的に低減する。また,ビデオフレーム間の時間的関係と各フレーム内の空間構造をエンコードするマルチモーダルモデリングモジュールを導入する。
参考スコア（独自算出の注目度）: 33.69757683688046
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recently, with the emergence of large language models, multimodal LLMs have demonstrated exceptional capabilities in image and video modalities. Despite advancements in video comprehension, the substantial computational demands of long video sequences lead current video LLMs (Vid-LLMs) to employ compression strategies at both the inter-frame level (e.g., uniform sampling of video frames) and intra-frame level (e.g., condensing all visual tokens of each frame into a limited number). However, this approach often neglects the uneven temporal distribution of critical information across frames, risking the omission of keyframes that contain essential temporal and semantic details. To tackle these challenges, we propose KFFocus, a method designed to efficiently compress video tokens and emphasize the informative context present within video frames. We substitute uniform sampling with a refined approach inspired by classic video compression principles to identify and capture keyframes based on their temporal redundancy. By assigning varying condensation ratios to frames based on their contextual relevance, KFFocus efficiently reduces token redundancy while preserving informative content details. Additionally, we introduce a spatiotemporal modeling module that encodes both the temporal relationships between video frames and the spatial structure within each frame, thus providing Vid-LLMs with a nuanced understanding of spatial-temporal dynamics. Extensive experiments on widely recognized video understanding benchmarks, especially long video scenarios, demonstrate that KFFocus significantly outperforms existing methods, achieving substantial computational efficiency and enhanced accuracy.
Abstract（参考訳）: 近年,大規模言語モデルの出現に伴い,マルチモーダルLLMは画像やビデオのモダリティにおいて例外的な機能を示した。ビデオ理解の進歩にもかかわらず、長大なビデオシーケンスの計算要求は、フレーム間レベル(例えば、ビデオフレームの均一サンプリング)とフレーム内レベル(例えば、各フレームのすべての視覚トークンを限られた数に凝縮する)の両方で圧縮戦略を採用するために、現在のビデオLLM(Vid-LLMs)を導く。しかし、このアプローチはしばしば、フレーム間の臨界情報の不均一な時間的分布を無視し、重要な時間的および意味的な詳細を含むキーフレームの欠落を危険にさらす。これらの課題に対処するために,ビデオトークンを効率よく圧縮し,映像フレーム内に存在する情報的コンテキストを強調するKFFocusを提案する。従来のビデオ圧縮原理に着想を得た一様サンプリングに代えて,時間的冗長性に基づく鍵フレームの識別とキャプチャを行う。 KFFocusは、コンテキスト関連性に基づいてフレームに様々な凝縮率を割り当てることで、情報コンテンツの詳細を保存しつつ、トークンの冗長性を効率的に低減する。さらに,ビデオフレーム間の時間的関係と各フレーム内の空間的構造をエンコードする時空間モデリングモジュールを導入し,空間的時間的ダイナミクスの微妙な理解をVid-LLMに提供する。広範に認識されているビデオ理解ベンチマーク、特に長いビデオシナリオに関する大規模な実験は、KFFocusが既存の手法を著しく上回り、計算効率と精度の向上を実現していることを示している。

論文の概要: KFFocus: Highlighting Keyframes for Enhanced Video Understanding

関連論文リスト