Fugu-MT 論文翻訳(概要): RUQuant: Towards Refining Uniform Quantization for Large Language Models

論文の概要: RUQuant: Towards Refining Uniform Quantization for Large Language Models

arxiv url: http://arxiv.org/abs/2604.04013v1
Date: Sun, 05 Apr 2026 08:04:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.873682
Title: RUQuant: Towards Refining Uniform Quantization for Large Language Models
Title（参考訳）: RUQuant: 大規模言語モデルの統一量子化に向けて
Authors: Han Liu, Haotian Gao, Changya Li, Feng Zhang, Xiaotong Zhang, Wei Wang, Hong Yu,
Abstract要約: ポストトレーニング量子化(PTQ)は、再トレーニングを必要とせずにモデルを圧縮することで、実用的なソリューションとして登場した。既存の方法は、アクティベーション分布の非一様性により、かなりの精度の劣化に悩まされることが多い。本研究では,ロイド-マックス最適条件に基づく理論的な観点から,活性化量子化問題を再考する。
参考スコア（独自算出の注目度）: 17.258420059228808
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The increasing size and complexity of large language models (LLMs) have raised significant challenges in deployment efficiency, particularly under resource constraints. Post-training quantization (PTQ) has emerged as a practical solution by compressing models without requiring retraining. While existing methods focus on uniform quantization schemes for both weights and activations, they often suffer from substantial accuracy degradation due to the non-uniform nature of activation distributions. In this work, we revisit the activation quantization problem from a theoretical perspective grounded in the Lloyd-Max optimality conditions. We identify the core issue as the non-uniform distribution of activations within the quantization interval, which causes the optimal quantization point under the Lloyd-Max criterion to shift away from the midpoint of the interval. To address this issue, we propose a two-stage orthogonal transformation method, RUQuant. In the first stage, activations are divided into blocks. Each block is mapped to uniformly sampled target vectors using composite orthogonal matrices, which are constructed from Householder reflections and Givens rotations. In the second stage, a global Householder reflection is fine-tuned to further minimize quantization error using Transformer output discrepancies. Empirical results show that our method achieves near-optimal quantization performance without requiring model fine-tuning: RUQuant achieves 99.8% of full-precision accuracy with W6A6 and 97% with W4A4 quantization for a 13B LLM, within approximately one minute. A fine-tuned variant yields even higher accuracy, demonstrating the effectiveness and scalability of our approach.
Abstract（参考訳）: 大規模言語モデル(LLM)のサイズと複雑さの増大は、特にリソース制約の下で、デプロイメント効率において大きな課題を引き起こしている。ポストトレーニング量子化(PTQ)は、再トレーニングを必要とせずにモデルを圧縮することで、実用的なソリューションとして登場した。既存の方法では、ウェイトとアクティベーションの両方の均一な量子化スキームに焦点が当てられているが、アクティベーション分布の非一様性のため、しばしばかなりの精度の劣化に悩まされる。本研究では,ロイド-マックス最適条件に基づく理論的な観点から,活性化量子化問題を再考する。中心問題は量子化区間内でのアクティベーションの非一様分布であり、ロイド・マックス基準の下での最適な量子化点が区間の中間点からずれる原因となる。そこで本研究では,2段階の直交変換法RUQuantを提案する。第1段階では、アクティベーションはブロックに分割される。各ブロックを合成直交行列を用いて一様サンプリング対象ベクトルにマッピングする。第2段階では、トランスフォーマー出力の相違を利用して、さらに量子化誤差を最小限に抑えるために、大域的なリフレクションを微調整する。 RUQuantはW6A6で99.8%,W4A4で97%,W4A4で約1分以内の量子化を実現する。微調整のバリエーションは、我々のアプローチの有効性とスケーラビリティを実証し、さらに高い精度が得られる。

論文の概要: RUQuant: Towards Refining Uniform Quantization for Large Language Models

関連論文リスト