Fugu-MT 論文翻訳(概要): Mixed-Precision Quantization for Language Models: Techniques and Prospects

論文の概要: Mixed-Precision Quantization for Language Models: Techniques and Prospects

arxiv url: http://arxiv.org/abs/2510.16805v1
Date: Sun, 19 Oct 2025 12:16:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.155808
Title: Mixed-Precision Quantization for Language Models: Techniques and Prospects
Title（参考訳）: 言語モデルのための混合精度量子化:技術と展望
Authors: Mariam Rakka, Marios Fournarakis, Olga Krestinskaya, Jinane Bazzi, Khaled N. Salama, Fadi Kurdahi, Ahmed M. Eltawil, Mohammed E. Fouda,
Abstract要約: 量子化は、モデルサイズを減らし、メモリボトルネックを緩和し、推論を加速する重要な圧縮技術として登場した。混合精度量子化は、効率と精度のバランスをとるために、層またはテンソル内で精度を選択的に割り振ることで、有望な代替手段を提供する。
参考スコア（独自算出の注目度）: 10.345914140081925
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid scaling of language models (LMs) has resulted in unprecedented computational, memory, and energy requirements, making their training and deployment increasingly unsustainable. Quantization has emerged as an essential compression technique to reduce model size, alleviate memory bottlenecks, and accelerate inference. However, while uniform low-bit quantization (e.g., INT8, INT4) provides significant efficiency gains, it can degrade accuracy in sensitive components of transformer-based LMs. Mixed-precision quantization offers a promising alternative by selectively allocating precision across layers or within tensors to balance efficiency and accuracy. This survey provides a comprehensive overview of Mixed-Precision quantization frameworks for LMs (MXPLMs). We first review quantization fundamentals, including uniform and non-uniform quantizers, quantization granularity, and methods widely used in post-training quantization. We then categorize and compare recent MXPLM frameworks according to their bit allocation strategies and precision configurations across weights, activations, and key-value caches. A comparative analysis highlights differences in perplexity, zero-shot task performance, and deployment trade-offs. Furthermore, we contrast MXPLMs with earlier mixed-precision quantization methods for deep neural networks, identifying strategies that transfer and those that face challenges in the LM setting. Finally, we summarize open issues and future directions, including hardware-aware design, activation quantization, and scalable optimization methods for billion-parameter models. By consolidating recent advances, this work serves as a reference for understanding the current landscape and research prospects of mixed-precision quantization for large-scale language models.
Abstract（参考訳）: 言語モデル(LM)の急速なスケーリングにより、前例のない計算、メモリ、エネルギーの要求が生まれ、その訓練と展開はますます持続不可能になっている。量子化は、モデルサイズを減らし、メモリボトルネックを緩和し、推論を加速する重要な圧縮技術として登場した。しかし、均一な低ビット量子化(例えば、INT8、INT4)は大きな効率向上をもたらすが、トランスフォーマーベースのLMの感度成分では精度が低下する。混合精度量子化は、効率と精度のバランスをとるために、層またはテンソル内で精度を選択的に割り振ることで、有望な代替手段を提供する。本稿では,Mixed-Precision Quantization framework for LMs (MXPLMs)について概説する。まず、一様および一様でない量子化器、量子化粒度、およびポストトレーニング量子化に広く用いられている方法など、量子化の基礎を概観する。次に、ビット割り当て戦略と重み、アクティベーション、キー値キャッシュの精度設定に基づいて、最近のMXPLMフレームワークを分類、比較する。比較分析では、難易度、ゼロショットタスクパフォーマンス、デプロイメントトレードオフの違いを強調している。さらに、MXPLMと、深層ニューラルネットワークの初期の混合精度量子化手法を対比し、転送戦略とLM設定の課題に直面しているものを同定する。最後に、ハードウェア対応設計、アクティベーション量子化、数十億パラメータモデルに対するスケーラブルな最適化など、オープンな問題と今後の方向性を要約する。近年の進歩を集約することにより、大規模な言語モデルに対する混合精度量子化の現在の展望と研究の展望を理解するための参考となる。

論文の概要: Mixed-Precision Quantization for Language Models: Techniques and Prospects

関連論文リスト