Fugu-MT 論文翻訳(概要): Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

論文の概要: Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

arxiv url: http://arxiv.org/abs/2604.04722v1
Date: Mon, 06 Apr 2026 14:45:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.23039
Title: Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs
Title（参考訳）: ビットを無駄にしない! 軽量オンデバイスLCMのための適応KVキャッシュ量子化
Authors: Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, Abolfazl Razi,
Abstract要約: 大規模言語モデル (LLM) は、推論、生成、意思決定のタスクで顕著な進歩を遂げた。オンデバイスLSM推論は、キー値(KV)キャッシュのメモリと帯域幅のオーバーヘッドによって大きく制約される。本稿では,トークンの重要度に比例したビット幅を割り当てる学習ポリシである適応KV-cache量子化を提案する。
参考スコア（独自算出の注目度）: 8.332279450103151
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding's principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.
Abstract（参考訳）: 大規模言語モデル(LLM)は、推論、生成、意思決定タスク全体にわたって顕著な進歩を遂げていますが、モバイル、組み込み、エッジデバイスにそれらをデプロイすることは、依然として特に困難です。オンデバイスLSM推論は、キー値(KV)キャッシュのメモリと帯域幅のオーバーヘッドによって大きく制約される。既存のKV-cache量子化スキームは、通常、固定精度または手作りのヒューリスティックに頼り、低インパクトトークンにビットを無駄にし、情報的トークンを過度に圧縮し、回避可能な精度劣化をもたらす。可変長アロケーションのHuffman符号化の原理に着想を得た適応KV-cache量子化は,トークンの重要度に比例してビット幅を割り当てる学習ポリシであり,競合精度を犠牲にすることなく,期待メモリとレイテンシを最小限に抑える。本フレームワークは,トークン周波数,品質スコア,アテンション分散,エントロピーに基づく不確実性などの軽量なトークンレベルの特徴を抽出し,デコード中の {2-bit, 4-bit, 8-bit, FP16} からKV精度を動的に選択する,コンパクトなデータ駆動コントローラに供給する。この適応精度ポリシーは、静的なKV量子化やルールベースのベースラインと比較して精度を向上しつつ、KVメモリのフットプリントとレイテンシを低減し、標準LLMベンチマーク間のFP16推論に近い競争精度を維持する。 SmolLM-135M, SmolLM-360M, SmolLM-1.7B を用いた複数のコモンセンス推論ベンチマークによる広範囲な実験により, 制御器の精度・レイテンシのトレードオフが一貫した改善を示す。例えば、HellaSwag上のSmolLM-360Mでは、静的KV量子化と比較してデコード遅延(ms/token)を17.75%削減し、精度を7.60ポイント改善し、FP16推論の0.30ポイント以内に留まる。

論文の概要: Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

関連論文リスト