Fugu-MT 論文翻訳(概要): OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

論文の概要: OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

arxiv url: http://arxiv.org/abs/2511.12201v2
Date: Tue, 18 Nov 2025 23:07:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-20 13:41:21.099012
Title: OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs
Title（参考訳）: OmniSparse:長時間ビデオMLLMのための訓練用細粒度スパースアテンション
Authors: Feng Chen, Yefei He, Shaoxuan He, Yuanyu He, Jing Liu, Lequan Lin, Akide Liu, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi Wu,
Abstract要約: OmniSparseは、長時間ビデオMLLMのための、トレーニング対応のきめ細かなスパークアテンションフレームワークである。実験結果から,OmniSparseはプリフィル時の2.7倍,デコード時の2.4倍のメモリ削減を実現しつつ,全注目性能と一致していることがわかった。
参考スコア（独自算出の注目度）: 43.78743496579736
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.
Abstract（参考訳）: 既存のスパースアテンション手法は主に、予め定義された間隔パターンの下で臨界トークンを選択することにより、推論時加速度をターゲットとする。しかし、トレーニングと推論のギャップを埋めることができず、クエリ、キー値(KV)、ヘッドといった複数の次元にわたるきめ細かいトークンの選択能力が欠如しているため、パフォーマンスが最適以下で、アクセラレーションが制限される。本稿では,長期ビデオMLLMのための細粒度スパルスアテンションフレームワークであるOmniSparseについて紹介する。具体的には,(1) 遅延アクティブ分類によるクエリ選択,(2) 局所的コンテキストに限定した遅延を排除し,高い機能的冗長性を示す一方で,広範囲なセマンティックな類似性を捕捉するアクティブクエリの保持,(2) ヘッドレベルの動的予算割り当てによるKV選択,(3) ヘッドレベルのデコードクエリパターンに従って視覚的KVキャッシュを選択的に取得することにより,ヘッドレベルの冗長性を低減するためのKVキャッシュスリム化,の3つの適応的・補完的なメカニズムを含む。実験結果から,OmniSparseはプリフィル時の2.7倍,デコード時の2.4倍のメモリ削減を実現しつつ,全注目性能と一致していることがわかった。

論文の概要: OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

関連論文リスト