Fugu-MT 論文翻訳(概要): One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

論文の概要: One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

arxiv url: http://arxiv.org/abs/2604.14149v2
Date: Thu, 16 Apr 2026 15:48:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 16:09:14.213367
Title: One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
Title（参考訳）: 高精選フレームに1つのトークン:長時間ビデオ理解のための極端圧縮を目指して
Authors: Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang,
Abstract要約: 長いビデオ理解は、膨大なフレーム数のため、視覚言語モデル(VLM)にとって本質的に困難である。通常、ビデオフレームは数十から数百のトークンに拡張されるため、大きな言語モデル(LLM)のコンテキスト長は制限され、VLMはフレームをわずかに知覚し、時間的情報を失う。本稿では,XComp という長大なビデオ理解のための極端な圧縮モデルを提案する。
参考スコア（独自算出の注目度）: 51.08792182064565
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards one token per frame at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into learnable and progressive modules for token-level compression (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate frame-level compression, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named question-conditioned compression (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, i.e., the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined token-level and frame-level leads to an extreme compression model for long video understanding, named XComp, achieving a significantly larger compression ratio and enabling denser frame sampling. Our XComp is finetuned from VideoChat-Flash with a data-efficient supervised compression tuning stage that only requires 2.5% of the supervised fine-tuning data, yet boosts the accuracy from 42.9% to 46.2% on LVBench and enhances multiple other long video benchmarks.
Abstract（参考訳）: 長いビデオ理解は、膨大なフレーム数のため、視覚言語モデル(VLM)にとって本質的に困難である。通常、ビデオフレームは数十から数百のトークンに拡張されるため、大きな言語モデル(LLM)のコンテキスト長は制限され、VLMはフレームをわずかに知覚し、時間的情報を失う。これを解決するために,最後のLCM層において,フレーム毎の1つのトークンに対する極端なビデオトークン圧縮について検討する。我々の重要な洞察は、従来の手法で広く採用されていたヒューリスティックな圧縮は、情報損失を招きやすいため、トークンレベルの圧縮(LP-Comp)のための学習可能で進歩的なモジュールにLLM層を監督する必要があるということである。このような圧縮により、VLMは2x-4倍のフレームを消化でき、性能が向上する。トークンの効率をさらに高めるために,LLM層の内部注意スコア(QC-Comp)を用いて,クエリに最も関連性の高いフレームを選択するフレームレベルの圧縮について検討する。従来の研究と顕著な区別として、長い動画を短いセグメントに分割し、局所的な注意を生かして、長い文脈におけるLLM注意の位置バイアス、すなわちシーケンスの開始と終了の過度な集中を緩和する。集合的に、トークンレベルとフレームレベルの組み合わせは、XCompという名前の長いビデオ理解のための極端な圧縮モデルをもたらし、より大きな圧縮比を達成し、より高密度なフレームサンプリングを可能にします。我々のXCompは、データ効率の高い教師付き圧縮チューニングステージで、教師付き微調整データの2.5%しか必要としないが、LVBenchの精度は42.9%から46.2%に向上し、他の長いビデオベンチマークも強化されている。

論文の概要: One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

関連論文リスト