Fugu-MT 論文翻訳(概要): UltraQuant: 4-bit KV Caching for Context-Heavy Agents

論文の概要: UltraQuant: 4-bit KV Caching for Context-Heavy Agents

arxiv url: http://arxiv.org/abs/2606.20474v2
Date: Fri, 19 Jun 2026 08:02:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-23 13:41:31.04053
Title: UltraQuant: 4-bit KV Caching for Context-Heavy Agents
Title（参考訳）: UltraQuant:コンテキスト重エージェントのための4ビットKVキャッシング
Authors: Inesh Chakrabarti, David Limpus, Aditi Ghai Rana, Bowen Bao, Spandan Tiwari, Thiago Crepaldi, Ashish Sirasao,
Abstract要約: コンテキスト重エージェントはキー値(KV)キャッシュに異常な圧力を与える。この設定のために,TurboQuantスタイルの回転とコードブック量子化を用いた4ビットKV-cache圧縮について検討した。最適化されたデコードアテンションカーネルやUltraQuantを含むAMD GPU上でのサービス最適化を提案する。
参考スコア（独自算出の注目度）: 2.0497179932020444
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Context-heavy agents place unusual pressure on the key-value (KV) cache: long prefixes are reused across many short turns, while concurrency determines whether the serving system can keep GPUs utilized. We study 4-bit KV-cache compression for this setting, using TurboQuant-style rotation and codebook quantization as a quality anchor and vLLM FP8 KV caching as the deployment anchor. We report three contributions. First, we frame 4-bit KV caching around multi-round agent workloads where task quality, cache residency, and serving throughput must be measured jointly. Second, we describe the practical design choices needed to make the 4-bit path robust, including asymmetric K/V treatment, Walsh-Hadamard rotation, QJL removal, and block-scale variants. Third, we present serving optimizations on AMD GPUs, including optimized decode-attention kernels and UltraQuant, an FP4 approximation path that uses FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA support on CDNA4. On a long-context, multi-turn agentic workload, UltraQuant cuts P50 time-to-first-token by 3.47x in the cache-pressured late rounds (2.3x across all rounds) and raises output throughput by 1.63x over the FP8 KV baseline.
Abstract（参考訳）: コンテキスト重エージェントはキー値(KV)キャッシュに異常な圧力をかける:長いプレフィックスは多くの短いターンで再利用される一方、並行処理システムはGPUを利用できるかどうかを判断する。そこで我々は,TurboQuantスタイルのローテーションとコードブック量子化を品質アンカーとして,vLLM FP8 KVキャッシングを配置アンカーとして,4ビットKVキャッシュ圧縮について検討した。私たちは3つの貢献を報告します。まず、タスク品質、キャッシュ常駐性、サービススループットを共同で測定する必要があるマルチラウンドエージェントワークロードを中心に、4ビットKVキャッシュをフレーム化する。次に,非対称なK/V処理,Walsh-Hadamard回転,QJL除去,ブロックスケールの変形など,4ビットパスを堅牢にするための実用的な設計選択について述べる。第三に、最適化されたデコードアテンションカーネルとUltraQuant、FP8クエリを使用するFP4近似パス、FP4 KVテンソル、UE8M0グループスケール、CDNA4のネイティブスケールMFMAサポートなど、AMD GPU上での最適化を提供する。長いコンテキストでマルチターンのエージェントのワークロードでは、UltraQuantはキャッシュ圧縮後期ラウンド(2.3倍)でP50を3.47倍に削減し、FP8 KVベースラインで出力スループットを1.63倍に向上させる。

論文の概要: UltraQuant: 4-bit KV Caching for Context-Heavy Agents

関連論文リスト