Fugu-MT 論文翻訳(概要): SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

論文の概要: SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

arxiv url: http://arxiv.org/abs/2510.17777v1
Date: Mon, 20 Oct 2025 17:35:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.54606
Title: SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
Title（参考訳）: SparseVILA: 効率的なVLM推論のための視覚的疎結合の分離
Authors: Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N. Plataniotis, Yao Lu, Song Han, Zhijian Liu,
Abstract要約: SparseVILAは効率的なVLM推論のための新しいパラダイムであり、前処理と復号の段階で視覚空間を疎結合する。 AWQ最適化推論パイプライン上に構築されたSparseVILAは、プリフィルの最大4.0倍、デコーディングの2.5倍、長文ビデオタスクの2.6倍のエンドツーエンド高速化を実現している。
参考スコア（独自算出の注目度）: 49.84148668264725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks -- while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.
Abstract（参考訳）: 視覚言語モデル (VLM) は、視覚的およびテキスト的推論の統合、高解像度の画像理解、長時間のビデオ分析、マルチターン会話におけるアプリケーションのパワー化において急速に進歩している。しかし、そのスケーラビリティは、推論遅延を支配する視覚トークンの数の増加によって制限されている。 SparseVILAは効率的なVLM推論のための新しいパラダイムであり、前処理と復号の段階で視覚空間を疎結合する。 SparseVILAは、プリフィル中に冗長なビジュアルトークンをプルーニングし、デコーディング中にクエリ関連トークンのみを取得することで、ステージ間でスパーシを分散する。この分離された設計は、マルチターンの忠実さを維持しながら、主要なプリフィルプルーニング手法と一致し、ほとんどのビジュアルキャッシュを保持し、各会話ラウンドでクエリ対応トークンを検索できるようにしている。 AWQ最適化推論パイプライン上に構築されたSparseVILAは、プリフィルの最大4.0倍、デコーディングの2.5倍、長文ビデオタスクの2.6倍のエンドツーエンドスピードアップを実現し、文書の理解と推論タスクの精度を改善した。クエリ非依存のプルーニングとクエリ対応の検索を分離することで、SparseVILAは効率的なマルチモーダル推論のための新しい方向を確立する。

論文の概要: SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

関連論文リスト