Fugu-MT 論文翻訳(概要): Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

論文の概要: Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

arxiv url: http://arxiv.org/abs/2503.18599v1
Date: Mon, 24 Mar 2025 11:56:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-03-25 16:32:17.252219
Title: Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
Title（参考訳）: Oaken: オンライン-オフラインハイブリッドKVキャッシュ量子化による高速かつ効率的なLLMの実現
Authors: Minsu Kim, Seongmin Hong, RyeoWook Ko, Soongyu Choi, Hunjong Lee, Junsoo Kim, Joo-Young Kim, Jongse Park,
Abstract要約: 我々は,高い精度と高い性能を同時に達成するアクセラレーションソリューションであるOakenを提案する。 Oakenはオンラインとオフラインのハイブリッドアプローチを採用し、オフラインのしきい値を設定して、オンラインの量子化スケールを決定する。我々の実験によると、256のバッチサイズでは、OakenはA100 GPUよりも最大1.58倍のスループット向上を実現し、最小精度の損失は平均0.54%である。
参考スコア（独自算出の注目度）: 17.202495171443932
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern Large Language Model serving system batches multiple requests to achieve high throughput, while batching attention operations is challenging, rendering memory bandwidth a critical bottleneck. The community relies on high-end GPUs with multiple high-bandwidth memory channels. Unfortunately, HBM's high bandwidth often comes at the expense of limited memory capacity, which reduces core utilization and increases costs. Recent advancements enabling longer contexts for LLMs have substantially increased the key-value cache size, further intensifying the pressures on memory capacity. The literature has explored KV cache quantization techniques, which commonly use low bitwidth for most values, selectively using higher bitwidth for outlier values. While this approach helps achieve high accuracy and low bitwidth simultaneously, it comes with the limitation that cost for online outlier detection is excessively high, negating the advantages. We propose Oaken, an acceleration solution that achieves high accuracy and high performance simultaneously through co-designing algorithm and hardware. To effectively find a sweet spot in the accuracy-performance trade-off space of KV cache quantization, Oaken employs an online-offline hybrid approach, setting outlier thresholds offline, which are then used to determine the quantization scale online. To translate the proposed algorithmic technique into tangible performance gains, Oaken also comes with custom quantization engines and memory management units that can be integrated with any LLM accelerators. We built an Oaken accelerator on top of an LLM accelerator, LPU, and conducted a comprehensive evaluation. Our experiments show that for a batch size of 256, Oaken achieves up to 1.58x throughput improvement over NVIDIA A100 GPU, incurring a minimal accuracy loss of only 0.54\% on average, compared to state-of-the-art KV cache quantization techniques.
Abstract（参考訳）: 最新のLarge Language Modelサービスシステムは、高いスループットを達成するために複数のリクエストをバッチするが、バッチ処理の注意操作は困難であり、メモリ帯域幅は重大なボトルネックとなる。コミュニティは、複数の高帯域メモリチャネルを持つハイエンドGPUに依存している。残念なことに、HBMの高帯域幅はメモリ容量の制限を犠牲にするため、コア使用率を低下させ、コストを増大させる。 LLMの長いコンテキストを可能にする最近の進歩は、キー値のキャッシュサイズを大幅に増加させ、さらにメモリ容量に対する圧力を強めている。文献では、ほとんどの値に低ビット幅を用いるKVキャッシュ量子化手法を探索し、外れ値に高ビット幅を選択的に用いた。このアプローチは高い精度と低ビット幅を同時に達成するのに役立つが、オンラインの外れ値検出のコストは過度に高く、利点を否定する。我々は,協調設計アルゴリズムとハードウェアを用いて,高精度かつ高い性能を同時に達成するアクセラレーションソリューションであるOakenを提案する。 KVキャッシュ量子化の精度-パフォーマンストレードオフ空間のスイートスポットを効果的に見つけるために、Oaken氏はオンライン-オフラインハイブリッドアプローチを採用し、アウトリーチ閾値をオフラインに設定し、それをオンラインの量子化スケールを決定する。提案されたアルゴリズムテクニックを具体的なパフォーマンス向上に変換するために、Oakenはカスタム量子化エンジンとメモリ管理ユニットを備えており、任意のLCMアクセラレータと統合できる。われわれは,LLMアクセラレータ,LPU上にOakenアクセラレータを構築し,総合的な評価を行った。我々の実験によると、256のバッチサイズでは、OakenはNVIDIA A100 GPUよりも最大1.58倍のスループット向上を実現しており、最先端のKVキャッシュ量子化技術と比較して、平均して0.54倍の精度の損失しか得られていない。

論文の概要: Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

関連論文リスト