Fugu-MT 論文翻訳(概要): Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

論文の概要: Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

arxiv url: http://arxiv.org/abs/2511.18643v1
Date: Sun, 23 Nov 2025 22:54:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-25 18:34:24.946246
Title: Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost
Title（参考訳）: Kitty: 動的チャネルワイズ精度向上による2ビットKVキャッシュの高精度かつ効率的な量子化
Authors: Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athiwaratkun, Zhen Zheng, Shuaiwen Leon Song,
Abstract要約: Kittyは、混合精度KVキャッシュのためのアルゴリズムとシステムの共同設計である。 KVメモリを8倍近い精度で削減し、最大8倍のバッチと2.1倍-4.1倍のスループットを実現した。
参考スコア（独自算出の注目度）: 24.865752290192372
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm-system co-design for mixed-precision KV caching: Kitty. On the algorithm side, extensive experiments show that Dynamic Channel-wise Precision Boost -- which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision -- maintains near-zero loss in accuracy drop while approaching 2-bit memory. The main challenge is handling dynamic 4-bit channel boosts while keeping the page layout coalesced and the dequantization uniform, with no scattered reads or hard-coded masks. Kitty addresses these issues by decompose each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout, Triton-compatible page dequantization kernels, and a lightweight runtime pipeline that preserves coalescing and avoids divergence. Across seven tasks and two model families (Qwen3, LLaMA3), Kitty cuts KV memory by nearly 8x with negligible accuracy loss, enabling up to 8x larger batches and 2.1x-4.1x higher throughput under the same memory budget. We release the full implementation of Kitty at https://github.com/Summer-Summer/Kitty.
Abstract（参考訳）: KVキャッシュはLLM推論における主要なメモリボトルネックである。 4ビットのKV量子化は精度を保つが、2ビットはしばしば劣化する。我々はこのギャップを、混合精度KVキャッシングのためのアルゴリズムとシステムの共同設計で埋める: Kitty。アルゴリズム側では、Dynamic Channel-wise Precision Boost -- キーキャッシュチャネルを感度でランク付けし、高い精度でわずかに保持する — が、2ビットメモリに近づきながら、ほぼゼロの精度低下を維持していることを示している。主な課題は、ページレイアウトを合体させながらダイナミックな4ビットチャネルのブーストを処理し、読み出しやハードコードされたマスクを使わずにデカンタライズすることです。 Kittyは、各混合精度キーページを2ビットの精度で2つのテンソルに分解することで、これらの問題に対処する。これに基づいてKittyは、ページ中心のKVレイアウト、Triton互換のページデクエンタライズカーネル、コネッションを保存し、分散を回避する軽量ランタイムパイプラインを提供する。 7つのタスクと2つのモデルファミリ(Qwen3、LLaMA3)で、KittyはKVメモリをほぼ8倍の精度で削減し、最大8倍のバッチと2.1x-4.1倍のスループットを実現した。 Kittyの完全な実装はhttps://github.com/Summer-Summer/Kitty.comで公開しています。

論文の概要: Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

関連論文リスト