Fugu-MT 論文翻訳(概要): Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

論文の概要: Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

arxiv url: http://arxiv.org/abs/2606.05429v1
Date: Wed, 03 Jun 2026 20:51:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.398552
Title: Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models
Title（参考訳）: 隠れたスケールコストの最小化:大規模言語モデルのためのグラフ誘導超低ビット量子化
Authors: Rayyan Abdalla, Amir Hussein, Min Wu, Dinesh Manocha,
Abstract要約: 学習後量子化(PTQ)は,大規模言語モデル(LLM)の効率的な展開に重要である隠れスケーリングコストを最小限に抑えるLLMのための超低ビット量子化フレームワークであるSAGE-PTQを提案する。 LLaMA-3B では、SAGE-PTQ は BiLLM の 55.8 と比較して 6.74 WikiText2 のパープレキシティを実現している。
参考スコア（独自算出の注目度）: 50.16014098038291
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hidden scaling overhead. We propose SAGE-PTQ (Saliency-Aware Graph-guided Efficient PTQ), a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling cost. SAGE-PTQ separates salient and unsalient weights using distributional statistics, then models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer. SAGE-PTQ applies dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. To reduce scaling overhead, SAGE-PTQ uses one per-channel scale for salient weights and one scalar per unsalient group. Finally, SAGE-PTQ implements adaptive saliency thresholding to select the optimal saliency ratio per matrix. SAGE-PTQ achieves 1.03 weight bits and only 0.004 scaling bits per matrix on average, outperforming state-of-the-art methods such as BiLLM and PB-LLM. On LLaMA-3-8B, SAGE-PTQ achieves 6.74 WikiText2 perplexity, compared to 55.8 for BiLLM, while using less than 50% of BiLLM's GPU memory. On LLaMA-2-70B, SAGE-PTQ provides 1.5x faster decoding on one NVIDIA L40 GPU, demonstrating practical inference efficiency.
Abstract（参考訳）: 学習後の量子化(PTQ)は、大規模言語モデル(LLM)の効率的な展開に不可欠である。最近の超低ビットPTQ法は、厳密な重み付けの仮定や位置ヒューリスティックに依存しており、かなりのスケーリングオーバーヘッドをもたらす。 SAGE-PTQ (Saliency-Aware Graph-Guided Efficient PTQ) は,LLMのための新しい超低ビット量子化フレームワークである。 SAGE-PTQは分布統計を用いて正重と非正重を分離し、非正重をスパースグラフとしてモデル化し、各層に最適なグループ数を推定する。 SAGE-PTQは二重モード量子化を適用し、多ビットの精度をサリアントウェイトに割り当て、非サリアントウェイトをバイナライズする。スケーリングオーバヘッドを低減するため、SAGE-PTQでは、サレントウェイトに1チャネル単位のスケール、アンサリアントグループに1スカラーを使用する。最後に、SAGE-PTQは、行列ごとの最適な相性比を選択するために適応相性しきい値を設定する。 SAGE-PTQは、平均で1.03の重み付きビットと0.004のスケーリングビットしか達成せず、BiLLMやPB-LLMのような最先端の手法よりも優れている。 LLaMA-3-8B では、SAGE-PTQ は BiLLM の 55.8 と比較して 6.74 WikiText2 のパープレキシティを実現している。 LLaMA-2-70Bでは、SAGE-PTQは1つのNVIDIA L40 GPU上で1.5倍高速なデコードを提供し、実用的な推論効率を示している。

論文の概要: Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

関連論文リスト