Fugu-MT 論文翻訳(概要): VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

論文の概要: VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

arxiv url: http://arxiv.org/abs/2510.06175v1
Date: Tue, 07 Oct 2025 17:35:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:08.391531
Title: VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
Title（参考訳）: VecInfer:Outlier-Suppressed Vector Quantizationによる低ビットKVキャッシュを用いた効率的なLCM推論
Authors: Dingyu Yao, Chenxu Yang, Zhengyang Tong, Zheng Lin, Wei Liu, Jian Luan, Weiping Wang,
Abstract要約: キーバリュー(KV)キャッシュは、大きな言語モデル(LLM)推論中にメモリオーバーヘッドを導入する。本稿では,効率的な推論を実現しつつ,能動的KVキャッシュ圧縮のための新しいVQ手法であるVecInferを提案する。 VecInferは、長いコンテキスト理解と数学的推論タスクの両方において、既存の量子化ベースラインを一貫して上回っている。
参考スコア（独自算出の注目度）: 23.781285860723248
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Key-Value (KV) cache introduces substantial memory overhead during large language model (LLM) inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across bit-widths, they suffer severe performance degradation at ultra-low bit-widths due to key cache outliers that hinder effective codebook utilization. To address this challenge, we propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference. By applying smooth and Hadamard transformations, VecInfer suppresses outliers in the key cache, enabling the codebook to comprehensively cover the original data distribution and thereby reducing quantization difficulty. To facilitate efficient deployment, we design an optimized CUDA kernel that fuses computation with dequantization to minimize memory access overhead. Extensive evaluations demonstrate that VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks. With only 2-bit quantization, VecInfer achieves performance comparable to full precision, while delivering up to $\mathbf{2.7\times}$ speedup in large-batch self-attention computation and $\mathbf{8.3\times}$ reduction in single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.
Abstract（参考訳）: キーバリュー(KV)キャッシュは、大きな言語モデル(LLM)推論中にかなりのメモリオーバーヘッドをもたらす。既存のベクトル量子化(VQ)法は、KVキャッシュの使用を減らし、ビット幅をまたいだ柔軟な表現能力を提供するが、鍵キャッシュのアウトレイラにより、効率的なコードブック利用を妨げるため、超低ビット幅での大幅な性能劣化を被る。この課題に対処するために,効率的な推論を実現するとともに,攻撃的KVキャッシュ圧縮のための新しいVQ手法であるVecInferを提案する。スムーズなアダマール変換を適用することで、VecInferはキーキャッシュの外れ値を抑制し、コードブックが元のデータ分布を包括的にカバーし、量子化の難しさを軽減する。効率的なデプロイを容易にするため,メモリアクセスオーバーヘッドを最小限に抑えるために,計算を復号化して融合する最適化されたCUDAカーネルを設計する。広範囲な評価により、VecInferは、長いコンテキスト理解と数学的推論タスクの両方において、既存の量子化ベースラインを一貫して上回っていることが示される。 2ビットの量子化のみで、VecInferは最大$$\mathbf{2.7\times}$大きなバッチ自己アテンション計算のスピードアップと$\mathbf{8.3\times}$196kのシーケンス長を持つLlama-3.1-8Bでの単一バッチエンドツーエンドのレイテンシの削減を実現した。

論文の概要: VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

関連論文リスト