Fugu-MT 論文翻訳(概要): GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

論文の概要: GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

arxiv url: http://arxiv.org/abs/2604.18556v1
Date: Mon, 20 Apr 2026 17:45:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:53.027599
Title: GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
Title（参考訳）: GSQ: Gumbel-SoftmaxサンプリングによるLDMの高精度スカラー量子化
Authors: Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtic, Dan Alistarh,
Abstract要約: GSQ(Gumbel-Softmax Quantization)は,グループ単位のグリッド割り当てとグループ単位のスケールを協調的に学習する,学習後のスカラー量子化手法である。標準のLlama-3.1-8B/70B-Instructモデルでは、GSQはスカラー量子化とQTIPフロンティアの差の大部分を2ビットと3ビットで閉じている。 GSQは,ベクトル量子化法の適用が困難なKim-K2.5のような1兆倍スケールのMixture-of-Expertsモデルにスケールすることを示す。
参考スコア（独自算出の注目度）: 36.47926569464477
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weight quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into two sets of methods: simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and "second-generation" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier at low bit-widths but are notoriously hard to implement and to scale, and have gained relatively less traction. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized scalar quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3-8 levels for ternary and 3 bpp, respectively), making the relaxation tight and the optimization tractable. Practically, on the standard Llama-3.1-8B/70B-Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization, and thus fully compatible with existing scalar inference kernels. We further show that GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply.
Abstract（参考訳）: 軽量量子化は、特にローカル推論において、効率的なLLMデプロイメントのための標準ツールとなり、1パラメータあたり2～3ビットでモデルが日常的に提供されるようになった。 GPTQ や AWQ のような単純なスカラー量子化手法は広く展開されているが、パラメータ (bpp) あたり3,4ビットの精度で精度が向上しているのに対し、QTIP、GPTVQ、AQLM のような「第2世代の」ベクトル量子化手法は、低ビット幅で精度のフロンティアを押し上げるが、実装やスケールが困難であり、比較的少ない。本稿では,このギャップが基本的なものなのか,あるいは慎重に最適化されたスカラー量子化器がその大部分を回収できるのかを問う。本稿では,GSQ(Gumbel-Softmax Quantization, Gumbel-Softmax Quantization, Gumbel-Softmax Quantization, GSQ)を導入して,離散格子のGumbel-Softmax緩和を用いて,協調格子の割り当てとグループ単位のスケールを協調的に学習する,学習後のスカラー量子化手法を提案する。 GSQは、緩和の基数と、目標ビット幅レジームで利用可能な少数のレベル(例えば、3bppの3-8レベルと3bppの3bpp)とを一致させ、緩和をきつくし、最適化も引き出せるようにした。実際、標準のLlama-3.1-8B/70B-インストラクタモデルでは、GSQはスカラー量子化とQTIPフロンティアのギャップの大部分を2ビットと3ビットで閉じている。さらに,ベクトル量子化法の適用が困難であるKim-K2.5のように,GSQが1兆倍スケールのMixture-of-Expertsモデルにスケールすることを示す。

論文の概要: GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

関連論文リスト