Fugu-MT 論文翻訳(概要): OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

論文の概要: OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

arxiv url: http://arxiv.org/abs/2605.19660v1
Date: Tue, 19 May 2026 10:53:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.28602
Title: OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
Title（参考訳）: OScaR: LLMにおける極端KVキャッシュ量子化のためのOccam's Razor
Authors: Zunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu, Yifan Zhang, Wei Wu, Jing Xiong, Dayou Du, Xialie Zhuang, Yulei Qian, Yuchen Xie, Yik-Chung Wu, Hongxia Yang, Ngai Wong,
Abstract要約: マルチモーダルインテリジェンスにより、Key-Valueキャッシュは効率的なデプロイメントのための主要なメモリボトルネックとなった。本研究では、チャネルごとの量子化パラダイムの本質的な限界を再考する。 X-LLMのための高精度かつ軽量なKVキャッシュ圧縮フレームワークOScaRを提案する。
参考スコア（独自算出の注目度）: 50.440302567029654
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.
Abstract（参考訳）: 長期コンテキスト推論とマルチモーダルインテリジェンスへの急速な進歩により、キーバリュー(KV)キャッシュのメモリフットプリントが、効率的なデプロイメントのための主要なメモリボトルネックとなった。確立されたチャネルごとの量子化は、キーテンソルの内在的なチャネルワイドのアウトレイラを効果的に許容するが、その有効性は極端な圧縮の下で低下する。本研究では、経験的および理論的両面から、チャネルごとの量子化パラダイムの固有の限界を再考する。本分析では,Token Norm Im Balance (TNI) が量子化忠実度の主要なボトルネックであることを確認した。我々は,共有量子化パラメータが有意なノルム差を示すトークン群にまたがる必要のある場合,TNIがエラーを体系的に増幅することを示した。複雑な量子化パイプライン(例えばTurboQuant)に頼る代わりに、X-LLMのための正確で軽量なKVキャッシュ圧縮フレームワークであるOScaR(Omni-Scaled Canalized Rotation)を提案する。チャネルごとのパラダイムを改良したOScaRでは、Canalyized RotationとOmni-Token Scalingを使用して、TNIによるシーケンス次元の分散を効果的かつ効率的に軽減し、最適化されたシステム設計とCUDAカーネルでさらにサポートしています。 X-LLMの広範な評価は、OScaRが既存のメソッドを一貫して上回り、INT2量子化の下でほぼ無作為なパフォーマンスを実現し、新しいParetoフロントを定義する堅牢で低複雑さで普遍的なフレームワークとして確立していることを示している。 BF16 FlashDecoding-v2ベースラインと比較して、OScaRの実装はデコードにおいて最大3.0倍の高速化を実現し、メモリフットプリントを5.3倍削減し、スループットを4.1倍向上させた。 OScaRのコードはhttps://github.com/ZunhaiSu/OScaR-KV-Quantで公開されている。

論文の概要: OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

関連論文リスト