Fugu-MT 論文翻訳(概要): Statistically-Lossless Quantization of Large Language Models

論文の概要: Statistically-Lossless Quantization of Large Language Models

arxiv url: http://arxiv.org/abs/2605.02404v1
Date: Mon, 04 May 2026 09:46:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.226751
Title: Statistically-Lossless Quantization of Large Language Models
Title（参考訳）: 大規模言語モデルの統計的ロスレス量子化
Authors: Michael Helcig, Eldar Kurtic, Dan Alistarh,
Abstract要約: 本稿では、量子化LDMにおけるロスレスの3つの相補的な概念を通して、統計的にロスレス圧縮の中間点について検討する。第一に、タスクロスレス圧縮は、自然サンプリングのばらつきの中でゼロショットベンチマークの精度を保ち、攻撃的なビット幅で達成可能である。第二に、分散ロスレス圧縮というより厳密な概念を定式化し、量子化モデルの次トーケン分布を、原点と事実上区別できないものにすることを要求する。第三に、対称量子化が非対称量子化に対するガンマ二乗によるノイズ分散を膨らませることを示すガンマ二乗分散法則を証明する。
参考スコア（独自算出の注目度）: 41.38595517076645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Model quantization has become essential for efficient large language model deployment, yet existing approaches involve clear trade-offs: methods such as GPTQ and AWQ achieve practical compression but are lossy, while lossless techniques preserve fidelity but typically do not accelerate inference. This paper explores the middle ground of statistically-lossless compression through three complementary notions of losslessness for quantized LLMs. First, task-lossless compression preserves zero-shot benchmark accuracy within natural sampling variance and remains achievable at aggressive bitwidths. Second, we formalize the stricter notion of distribution-lossless compression, requiring the quantized model's next-token distribution to be practically indistinguishable from the original, and propose the Expected Acceptance Rate (EAR), the maximum token-agreement probability under optimal coupling, as a directly interpretable fidelity metric (for example, EAR >= 0.99 indicates 99% agreement). Third, we prove a gamma-squared variance law showing that symmetric quantization inflates noise variance by gamma squared relative to asymmetric quantization, making asymmetry necessary for distribution-lossless fidelity but not for task-level preservation. Using SLQ, a layer-wise non-uniform method with asymmetric quantization and wide bitwidth search, we achieve task-lossless compression at well below 4 bits per parameter (as low as 3.3 bits depending on the model), distribution-lossless compression at 5 to 6 bits per parameter on average, and inference speedups of 1.7 to 3.6x relative to FP16 with optimized kernels. Source code is available at https://github.com/IST-DASLab/SLQ.
Abstract（参考訳）: GPTQやAWQのような手法は実用的な圧縮を実現するが、損失のない手法は忠実さを保ちながら推論を加速しない。本稿では、量子化LDMにおけるロスレスの3つの相補的な概念を通して、統計的にロスレス圧縮の中間点について検討する。第一に、タスクロスレス圧縮は、自然サンプリングのばらつきの中でゼロショットベンチマークの精度を保ち、攻撃的なビット幅で達成可能である。第二に、量子化モデルの次トーケン分布を元のものと実質的に区別できないようにして、より厳密な分散ロスレス圧縮の概念を定式化し、直接解釈可能な忠実度指標として、最適結合下での最大トークン獲得確率である予測アクセプタンスレート(EAR)を提案する(EAR >= 0.99 は99%の一致を示す)。第三に、対称量子化が非対称量子化に対するガンマ二乗によるノイズ分散を膨らませることを示すガンマ二乗分散法則を証明し、非対称量子化に対する非対称化は、分配ロスレス忠実性には必要であるが、タスクレベルの保存には必要であることを示す。非対称量子化と広帯域探索を備えた層ワイド非一様法であるSLQを用いて、パラメータ毎の4ビット以下(モデルによっては3.3ビット以下)のタスクロスレス圧縮、平均でパラメータ毎の5～6ビットの分散ロスレス圧縮、最適化されたカーネルを持つFP16と比較して1.7～3.6倍の推論高速化を実現する。ソースコードはhttps://github.com/IST-DASLab/SLQ.comで入手できる。

論文の概要: Statistically-Lossless Quantization of Large Language Models

関連論文リスト