Fugu-MT 論文翻訳(概要): Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention

論文の概要: Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention

arxiv url: http://arxiv.org/abs/2603.21957v1
Date: Mon, 23 Mar 2026 13:15:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.685873
Title: Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention
Title（参考訳）: 超低音域におけるビデオLLMの同時時空間トーケン圧縮
Authors: Junhao Du, Jialong Xue, Anqi Li, Jincheng Dai, Guo Lu,
Abstract要約: ビデオ言語モデル(ビデオ-LLM)は、大量の視覚トークンのために高い計算コストに直面している。グローバルな選択トークンに意味的類似性を重み付けする統一選択機構を提案する。選択されていないトークンはクラスタリングとリフィルによってマージされ、情報の整合性を保持する。我々の統合的時間トークン圧縮戦略は,超低トークン保持下での映像理解における最先端技術を確立する。
参考スコア（独自算出の注目度）: 23.015486635502437
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.
Abstract（参考訳）: ビデオ大言語モデル(ビデオ-LLM)は、大量の視覚トークンのために高い計算コストに直面している。既存のトークン圧縮手法は、通常、2段階の時空間圧縮戦略を採用し、ステージ固有のメトリクスと時空間分離性の暗黙の仮定に依存している。しかし、極端に低い保持率の下では、このようなアプローチは、質問応答に必須の視覚的証拠の不均衡な割り当てと損失をもたらすことが多い。我々は,グローバルトークン保持プール内の時空間割り当てタスクとしてトークン圧縮を再構成する。本稿では,注目重みと意味的類似性を統合した統一選択機構を提案する。選択されていないトークンはクラスタリングとリフィルによってマージされ、情報の整合性を保持する。 LLMの内部では、クエリ関連性に基づいた二次的な圧縮を行うために、テキスト認識のマージも導入する。本手法は,既存のビデオLLMと互換性のあるプラグイン・アンド・プレイモジュールとして機能する。実験によると、視覚トークンの約2%しか保持していないことは、複数のベンチマークで90.1%のベースライン性能を維持し、FLOPを約2.6%に削減している。これらの利点は様々なバックボーンにまたがって一般化され、エンドツーエンドの推論遅延とメモリ消費が減少する。我々の統合時空間トークン圧縮戦略は,超低トークン保持下での映像理解の最先端性を確立する。

論文の概要: Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention

関連論文リスト