Fugu-MT 論文翻訳(概要): SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

論文の概要: SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

arxiv url: http://arxiv.org/abs/2405.14917v1
Date: Thu, 23 May 2024 16:21:48 GMT
ステータス: 翻訳完了
システム内更新日: 2024-05-27 19:48:22.499105
Title: SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models
Title（参考訳）: SliM-LLM:大規模言語モデルのためのサリエンス駆動混合精度量子化
Authors: Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, Xiaojuan Qi,
Abstract要約: 後学習量子化(PTQ)は、大規模言語モデル(LLM)において研究される強力な圧縮手法である。既存のPTQ法は、特に4ビット幅以下では、精度と効率の点で理想的ではない。本稿では,LSM,すなわちSliM-LLMに対するSalience-Driven Mixed-Precision Quantizationスキームを提案する。
参考スコア（独自算出の注目度）: 67.67135738642547
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) achieve remarkable performance in natural language understanding but require substantial computation and memory resources. Post-training quantization (PTQ) is a powerful compression technique extensively investigated in LLMs. However, existing PTQ methods are still not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. Standard PTQ methods using group-wise quantization suffer difficulties in quantizing LLMs accurately to such low-bit, but advanced methods remaining high-precision weights element-wisely are hard to realize their theoretical hardware efficiency. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM. The scheme exploits the salience distribution of weights to determine optimal bit-width and quantizers for accurate LLM quantization, while aligning bit-width partition to groups for compact memory usage and fast integer inference. Specifically, the proposed SliM-LLM mainly relies on two novel techniques: (1) Salience-Determined Bit Allocation utilizes the clustering characteristics of salience distribution to allocate the bit-widths of each group, increasing the accuracy of quantized LLMs and maintaining the inference efficiency; (2) Salience-Weighted Quantizer Calibration optimizes the parameters of the quantizer by considering the element-wise salience within the group, balancing the maintenance of salient information and minimization of errors. Comprehensive experiments show that SliM-LLM significantly improves the accuracy of LLMs at ultra-low bits, e.g., 2-bit LLaMA-7B achieves a 5.5-times memory-saving than original model on NVIDIA A800 GPUs, and 48% decrease of perplexity compared to the state-of-the-art gradient-free PTQ method. Moreover, SliM-LLM+, which is integrated from the extension of SliM-LLM with gradient-based quantizers, further reduces perplexity by 35.1%.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自然言語理解において顕著な性能を達成するが、かなりの計算量とメモリ資源を必要とする。ポストトレーニング量子化(PTQ)はLLMで広く研究されている強力な圧縮技術である。しかし、既存のPTQ法は、特に4ビット幅以下では、精度と効率の面ではまだ理想的ではない。グループワイド量子化を用いた標準PTQ法は、LSMをそのような低ビットに正確に量子化するのに苦労するが、高精度な重みを残した先進的な手法は、その理論的なハードウェア効率を実現するのが困難である。本稿では,LSM,すなわちSliM-LLMに対するSalience-Driven Mixed-Precision Quantizationスキームを提案する。このスキームは重みの塩分分布を利用して最適ビット幅と量子化器を正確にLLM量子化するために決定し、ビット幅分割をコンパクトなメモリ使用と高速整数推論のためにグループに整列させる。具体的には、SliM-LLMは、主に2つの新しい手法に依存している: 1) 分散分布のクラスタリング特性を利用して、各グループのビット幅を割り当て、量子化LSMの精度を高め、推論効率を向上する; (2) 量子化器のパラメータを、グループ内の要素的サリエンスを考慮して最適化し、サリエント情報の維持とエラーの最小化のバランスをとる。総合的な実験により、SliM-LLMは超低ビットでのLLMの精度を著しく改善し、例えば、2ビットのLLaMA-7BはNVIDIA A800 GPUのオリジナルモデルよりも5.5倍のメモリ節約を実現し、最先端の勾配のないPTQ法に比べて48%のパープレキシティが低下した。さらにSliM-LLM+は、SliM-LLMの拡張から勾配ベースの量子化器に統合され、さらにパープレキシティを35.1%削減する。

論文の概要: SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

関連論文リスト