Fugu-MT 論文翻訳(概要): MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

論文の概要: MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

arxiv url: http://arxiv.org/abs/2504.16083v1
Date: Tue, 22 Apr 2025 17:59:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-04-30 16:55:44.049953
Title: MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
Title（参考訳）: MM推論:Modality-Aware Permutation Sparse Attentionによる長期VLMの事前充足の高速化
Authors: Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu,
Abstract要約: MMInferenceは、長文マルチモーダル入力のプリフィルステージを高速化する動的スパースアテンション手法である。 MMInferenceは, 精度を維持しつつ, 1Mトークンにおいて, プリフィルステージを最大8.3倍高速化することを示す。
参考スコア（独自算出の注目度）: 61.025422435235456
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available at https://aka.ms/MMInference.
Abstract（参考訳）: 長いコンテキストと視覚的理解の統合は、ビジョン言語モデル(VLM)の先例のない可能性を解き放っている。しかし、充填前の段階における二次的な注意の複雑さは、実世界の展開にとって大きな障害である。この制限を克服するために、長文マルチモーダル入力の準備段階を高速化する動的スパースアテンション手法MMInference(Multimodality Million tokens Inference)を導入する。まず,映像入力の時間的・空間的局所性から,一意のスパースパターンであるグリッドパターンが導かれることを明らかにした。同時に、VLMは異なるモダリティ間で著しく異なるスパース分布を示す。本稿では、一意なグリッドパターンを活用し、モダリティ境界問題に対処するための置換に基づく手法を提案する。各ヘッドに対して最適なスパースパターンをオフラインで探索することにより、MMInferenceは入力に基づいてスパース分布を動的に構築する。また、効率的なスパース計算のために最適化されたGPUカーネルを提供する。特に、MMInferenceはモデル修正や微調整なしに既存のVLMパイプラインにシームレスに統合される。 Video QA、Captioning、VisionNIAH、Mixed-Modality NIAHを含むマルチモーダルベンチマーク(LongVila、LlavaVideo、VideoChat-Flash、Qwen2.5-VL)の実験では、MMInferenceは1Mトークンで最大8.3倍の精度でプリフィルステージを加速している。私たちのコードはhttps://aka.ms/MMInference.orgで利用可能です。

論文の概要: MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

関連論文リスト