Fugu-MT 論文翻訳(概要): Compressing Large Language Models using Low Rank and Low Precision Decomposition

論文の概要: Compressing Large Language Models using Low Rank and Low Precision Decomposition

arxiv url: http://arxiv.org/abs/2405.18886v1
Date: Wed, 29 May 2024 08:42:30 GMT
ステータス: 翻訳完了
システム内更新日: 2024-05-30 17:59:30.310258
Title: Compressing Large Language Models using Low Rank and Low Precision Decomposition
Title（参考訳）: 低ランク・低精度分解を用いた大規模言語モデル圧縮
Authors: Rajarshi Saha, Naomi Sagan, Varun Srivastava, Andrea J. Goldsmith, Mert Pilanci,
Abstract要約: この研究は、新しい訓練後のLLM圧縮アルゴリズムである$rm CALDERA$を導入している。重量行列 $mathbfW$ の固有の低ランク構造を利用して、低ランクで低精度な分解によってそれを近似する。その結果、LlaMa-$2$7$B/$70$B、LlaMa-$3$8$Bの圧縮は、既存のトレーニング後の圧縮技術より優れていることが示された。
参考スコア（独自算出の注目度）: 46.30918750022739
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces $\rm CALDERA$ -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix $\mathbf{W}$ by approximating it via a low-rank, low-precision decomposition as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$. Here, $\mathbf{L}$ and $\mathbf{R}$ are low rank factors, and the entries of $\mathbf{Q}$, $\mathbf{L}$ and $\mathbf{R}$ are quantized. The model is compressed by substituting each layer with its $\mathbf{Q} + \mathbf{L}\mathbf{R}$ decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, $\mathbf{L}$ and $\mathbf{R}$ are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. $\rm CALDERA$ obtains this decomposition by formulating it as an optimization problem $\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2$, where $\mathbf{X}$ is the calibration data, and $\mathbf{Q}, \mathbf{L}, \mathbf{R}$ are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of $\rm CALDERA$ are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-$2$ $7$B/$70$B and LlaMa-$3$ $8$B models obtained using $\rm CALDERA$ outperforms existing post-training LLM compression techniques in the regime of less than $2.5$ bits per parameter. The implementation is available at: \href{https://github.com/pilancilab/caldera}{https://github.com/pilancilab/caldera}.
Abstract（参考訳）: 現在、LLM(Large Language Models)の禁止サイズは、メモリ制約のあるエッジデバイスへのデプロイを困難にしている。このアルゴリズムは、重量行列 $\mathbf{W}$ の固有の低ランク構造を利用して、低ランクで低精度な分解を $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$ として近似することで、新しい学習後 LLM 圧縮アルゴリズムである $\rm CALDERA$ を導入する。ここで、$\mathbf{L}$ と $\mathbf{R}$ は低いランク因子であり、$\mathbf{Q}$, $\mathbf{L}$ と $\mathbf{R}$ のエントリは量子化される。モデルを各層に$\mathbf{Q} + \mathbf{L}\mathbf{R}$分解を代入して圧縮し、圧縮されたモデルのゼロショット性能を評価する。さらに、$\mathbf{L}$ と $\mathbf{R}$ は容易にローランク適応が可能となり、ゼロショット性能が向上する。 $\rm CALDERA$ はこの分解を最適化問題 $\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2$ として定式化し、$\mathbf{X}$ はキャリブレーションデータである。ランク制約回帰フレームワークを用いて,$\rm CALDERA$の近似誤差に関する理論的上限を設定し,目標ランクと量子化ビット予算の影響を分析して,圧縮率とモデル性能のトレードオフについて検討した。その結果、LlaMa-$2$$7$B/$70$BとLlaMa-$3$8$Bの圧縮は、パラメータあたり2.5ドル以下という既存のトレーニング後のLCM圧縮技術より優れていることが示された。実装は以下の通りである。 \href{https://github.com/pilancilab/caldera}{https://github.com/pilancilab/caldera}。

論文の概要: Compressing Large Language Models using Low Rank and Low Precision Decomposition

関連論文リスト