Fugu-MT 論文翻訳(概要): COMET: Towards Partical W4A4KV4 LLMs Serving

論文の概要: COMET: Towards Partical W4A4KV4 LLMs Serving

arxiv url: http://arxiv.org/abs/2410.12168v1
Date: Wed, 16 Oct 2024 02:16:53 GMT
ステータス: 翻訳完了
システム内更新日: 2024-11-28 17:07:36.102933
Title: COMET: Towards Partical W4A4KV4 LLMs Serving
Title（参考訳）: COMET: パーティショナルなW4A4KV4 LLMの実現に向けて
Authors: Lian Liu, Haimeng Ren, Long Cheng, Zhaohui Xu, Yudong Pan, Mengdi Wang, Xiaowei Li, Yinhe Han, Ying Wang,
Abstract要約: 量子化は、端末デバイスやクラウドデータセンターで大規模言語モデル(LLM)を提供するオーバーヘッドを低減するための圧縮技術である。本稿では,ほとんどのアクティベーションを4ビットに圧縮し,精度損失を無視できる新しい混合精度量子化アルゴリズム(FMPQ)を提案する。我々は、最適化されたW4Axカーネルを推論フレームワークCOMETに統合し、人気のあるLLMをサポートするための効率的な管理を提供する。
参考スコア（独自算出の注目度）: 37.30529940231099
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Quantization is a widely-used compression technology to reduce the overhead of serving large language models (LLMs) on terminal devices and in cloud data centers. However, prevalent quantization methods, such as 8-bit weight-activation or 4-bit weight-only quantization, achieve limited performance improvements due to poor support for low-precision (e.g., 4-bit) activation. This work, for the first time, realizes practical W4A4KV4 serving for LLMs, fully utilizing the INT4 tensor cores on modern GPUs and reducing the memory bottleneck caused by the KV cache. Specifically, we propose a novel fine-grained mixed-precision quantization algorithm (FMPQ) that compresses most activations into 4-bit with negligible accuracy loss. To support mixed-precision matrix multiplication for W4A4 and W4A8, we develop a highly optimized W4Ax kernel. Our approach introduces a novel mixed-precision data layout to facilitate access and fast dequantization for activation and weight tensors, utilizing the GPU's software pipeline to hide the overhead of data loading and conversion. Additionally, we propose fine-grained streaming multiprocessor (SM) scheduling to achieve load balance across different SMs. We integrate the optimized W4Ax kernel into our inference framework, COMET, and provide efficient management to support popular LLMs such as LLaMA-3-70B. Extensive evaluations demonstrate that, when running LLaMA family models on a single A100-80G-SMX4, COMET achieves a kernel-level speedup of \textbf{$2.88\times$} over cuBLAS and a \textbf{$2.02 \times$} throughput improvement compared to TensorRT-LLM from an end-to-end framework perspective.
Abstract（参考訳）: 量子化(quantization)は、端末デバイスやクラウドデータセンターで大規模言語モデル(LLM)を提供するオーバーヘッドを低減するために広く使用されている圧縮技術である。しかし、8ビットの重みのみの量子化や4ビットの重みのみの量子化など、一般的な量子化法は、低精度(例えば4ビット)のアクティベーションが不十分なため、限られた性能向上を実現している。この研究は、LLM向けの実用的なW4A4KV4を実現し、現在のGPU上のINT4テンソルコアを完全に活用し、KVキャッシュによるメモリボトルネックを低減した。具体的には、ほとんどのアクティベーションを4ビットに圧縮し、精度の損失を無視できる新しい微粒化混合精度量子化アルゴリズム(FMPQ)を提案する。 W4A4とW4A8の混合精度行列乗算をサポートするため,高度に最適化されたW4Axカーネルを開発した。提案手法では,アクティベーションと重みテンソルに対するアクセスと高速な復調を容易にするために,GPUのソフトウェアパイプラインを利用してデータのロードと変換のオーバーヘッドを隠蔽する。さらに,異なるSM間の負荷バランスを実現するための細粒度ストリーミングマルチプロセッサ(SM)スケジューリングを提案する。我々は、最適化されたW4Axカーネルを推論フレームワークCOMETに統合し、LLaMA-3-70Bのような一般的なLLMをサポートするための効率的な管理を提供する。大規模な評価では、単一のA100-80G-SMX4上でLLaMAファミリモデルを実行すると、COMETはcuBLAS上の \textbf{$2.88\times$} のカーネルレベルのスピードアップと、エンドツーエンドフレームワークの観点からの TensorRT-LLM と比較して \textbf{$2.02 \times$} のスループット改善を実現している。

論文の概要: COMET: Towards Partical W4A4KV4 LLMs Serving

関連論文リスト