Fugu-MT 論文翻訳(概要): AMS-QUANT: Adaptive Mantissa Sharing for Floating-point Quantization

論文の概要: AMS-QUANT: Adaptive Mantissa Sharing for Floating-point Quantization

arxiv url: http://arxiv.org/abs/2510.16045v1
Date: Thu, 16 Oct 2025 15:37:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:38.799929
Title: AMS-QUANT: Adaptive Mantissa Sharing for Floating-point Quantization
Title（参考訳）: AMS-QUANT:浮動小数点量子化のための適応マンティサ共有
Authors: Mengtao Lv, Ruiqi Zhu, Xinyu Wang, Yun Li,
Abstract要約: 量子化、特に浮動小数点量子化は、大きな言語モデル(LLM)推論を高速化できることが知られている。整数ビット幅から非整数ビット幅への浮動小数点量子化探索を探索するAMS-Quantを提案する。 AMS-Quant はモデルを FP-5.33-e2m3 と FP4.25-e2m2 に量子化し、FP16 の推論よりもデコードを大幅に高速化できることを示す。
参考スコア（独自算出の注目度）: 7.413057271242686
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in various kinds of tasks, while the billion or even trillion parameters bring storage and efficiency bottlenecks for inference. Quantization, particularly floating-point quantization, is known to be capable of speeding up LLM inference by reducing memory footprint and data movement during the inference process. For the first time, we advance the floating-point quantization exploration from integer bitwidths to non-integer bit-widths, namely AMS-Quant, to further approach the quantization sweet spot. AMS-Quant incorporates two novel techniques to put it into effect: (1) it proposes Mantissa-bit Sharing, which groups k quantized weights and lets them share the least significant mantissa bit, allowing us to further approach the minimum quantization bit-width without accuracy loss. (2) It introduces Adaptive Searching, which employs an offline optimization strategy to minimize the accuracy degradation introduced by sharing. Moreover, AMS-Quant is also prototyped as efficient CUDA Linear kernels, which translates memory savings into wall-clock latency reduction by reducing memory access. Extensive experiments on large-scale datasets and models show that AMS-Quant can quantize the model to FP-5.33-e2m3 and FP4.25-e2m2, and significantly speed up the LLM decoding over FP16 inference (2.8x and 3.2x), with negligible accuracy loss.
Abstract（参考訳）: 大規模言語モデル(LLM)は、様々なタスクにおいて顕著な能力を示し、数十兆のパラメータは、推論のためのストレージと効率のボトルネックをもたらす。量子化、特に浮動小数点量子化は、推論プロセス中にメモリフットプリントとデータ移動を減らすことで、LLM推論を高速化できることが知られている。初めて、整数ビット幅から非整数ビット幅、すなわちAMS-Quantへの浮動小数点量子化探索を進め、量子化スイートスポットにさらに近づく。 AMS-Quantには2つの新しい手法が組み込まれており、(1)k個の量子化重みをグループ化し、最小の量子化ビット幅を精度の低下なしにさらに近づくことができるMantissa-bit Sharingを提案する。 2)共有による精度劣化を最小限に抑えるために,オフライン最適化戦略を用いた適応探索を導入した。さらに、AMS-Quantは効率的なCUDAリニアカーネルとしてプロトタイプされており、メモリアクセスを減らしてメモリの節約をウォールクロック遅延の低減に変換する。大規模なデータセットとモデルに関する大規模な実験により、AMS-QuantはモデルをFP-5.33-e2m3とFP4.25-e2m2に定量化でき、FP16推論(2.8xと3.2x)でのLCMデコードを大幅に高速化し、精度の損失を無視できることを示した。

論文の概要: AMS-QUANT: Adaptive Mantissa Sharing for Floating-point Quantization

関連論文リスト