Fugu-MT 論文翻訳(概要): Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs

論文の概要: Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs

arxiv url: http://arxiv.org/abs/2509.11177v2
Date: Tue, 16 Sep 2025 03:17:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 11:35:27.003273
Title: Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs
Title（参考訳）: LLMの結合量子化とスパース化のための最適脳再生
Authors: Hang Guo, Yawei Li, Luca Benini,
Abstract要約: 量子化と空間性を組み合わせることで、代替ソリューションを探究する。この共同アプローチは有望ではあるが、重量分布の本質的に矛盾する要求のために新しい困難をもたらす。我々は、両者の誤り補償によるプルーニングと量子化を整合させる汎用およびトレーニング不要なフレームワークである最適脳再生(OBR)を提案する。
参考スコア（独自算出の注目度）: 26.03750191778535
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their respective limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.
Abstract（参考訳）: 近年のLarge Language Model (LLM)圧縮の進歩は,量子化やプルーニングといった顕著な成功を収めている。しかし、これらの手法がそれぞれの限界に徐々に近づくにつれて、さらなる圧縮のために単一の方法に頼ることがますます困難になっている。本研究では,量子化と空間性を組み合わせた代替解を提案する。量子化はコンパクトな範囲を好んでおり、高い分散から利益を得る。この問題に対処するために我々は,両者間の誤り補償によるプルーニングと量子化を整合させる汎用およびトレーニング不要なフレームワークである最適脳再生(OBR)を提案する。 OBRは2階のヘッセン目標の上に構築することで下流タスクの性能劣化を最小化し、サロゲート近似によりトラクタブルな問題に再構成され、最終的にはグループ誤差補償によって閉形式解に達する。実験の結果、OBRは既存のLLMで50%の間隔で攻撃的なW4A4KV4量子化を可能にし、FP16高密度ベースラインと比較して最大4.72倍のスピードアップと6.4倍のメモリ削減を実現している。

論文の概要: Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs

関連論文リスト