Fugu-MT 論文翻訳(概要): One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

論文の概要: One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

arxiv url: http://arxiv.org/abs/2604.14149v1
Date: Wed, 15 Apr 2026 17:59:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.675584
Title: One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
Title（参考訳）: 高精選フレームに1つのトークン:長時間ビデオ理解のための極端圧縮を目指して
Authors: Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang,
Abstract要約: 長いビデオ理解は、膨大なフレーム数のため、視覚制御モデル(VLM)にとって本質的に困難である。最終的な大言語モデル層において,フレーム毎のエンフォーントークンに対する極端ビデオトークン圧縮について検討する。これにより、VLMは2x-4倍のフレームを消化でき、性能が向上する。
参考スコア（独自算出の注目度）: 51.08792182064565
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards \emph{one token per frame} at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into \emph{learnable} and \emph{progressive} modules for \emph{token-level compression} (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate \emph{frame-level compression}, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named \emph{question-conditioned compression} (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, \emph{i.e.}, the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined \emph{token-level} and \emph{frame-level} leads to an e\textbf{x}treme compression model for long video understanding, named \textbf{\name}, achieving a significantly larger compression ratio and enabling denser frame sampling. Our \name is finetuned from VideoChat-Flash with a data-efficient \emph{supervised compression tuning} stage that only requires 2.5\% of the supervised fine-tuning data, yet boosts the accuracy from 42.9\% to 46.2\% on LVBench and enhances multiple other long video benchmarks.
Abstract（参考訳）: 長いビデオ理解は、膨大なフレーム数のため、視覚言語モデル(VLM)にとって本質的に困難である。通常、ビデオフレームは数十から数百のトークンに拡張されるため、大きな言語モデル(LLM)のコンテキスト長は制限され、VLMはフレームをわずかに知覚し、時間的情報を失う。これを解決するために,最後のLCM層において,フレーム毎のemph{one token}に対する極端ビデオトークン圧縮について検討する。我々の重要な洞察は、従来の方法で広く採用されているヒューリスティックな圧縮は、情報損失を招きがちであり、このことは、LP-Comp の \emph{learnable} および \emph{progressive} モジュールに LLM レイヤを監視する必要があるということである。このような圧縮により、VLMは2x-4倍のフレームを消化でき、性能が向上する。トークン効率をさらに高めるために,LLM層の内部アテンションスコアを用いてクエリに最も関係のあるフレームを選択する「emph{frame-level compression}」 (QC-Comp) について検討する。従来の研究と顕著な区別として、長い動画を短いセグメントに分割し、局所的な注意を生かして、シーケンスの開始と終了に過度に集中する、長い文脈におけるLLM注意の位置バイアスを緩和する。集合的に,<emph{token-level} と<emph{frame-level} を組み合わせることで,長いビデオ理解のための e\textbf{x}treme 圧縮モデルが得られる。我々の名前は、データ効率の良い \emph{supervised compression tuning} ステージで VideoChat-Flash から微調整され、監督された微調整データの 2.5 % しか必要としないが、LVBench では 42.9 % から 46.2 % に精度を向上し、他の長いビデオベンチマークも強化する。

論文の概要: One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

関連論文リスト