Fugu-MT 論文翻訳(概要): MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition

論文の概要: MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition

arxiv url: http://arxiv.org/abs/2604.04701v1
Date: Mon, 06 Apr 2026 14:13:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.220901
Title: MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition
Title（参考訳）: MUXQ:低ランク外乱分解による混合-一様精度行列量子化
Authors: Seoungsub Lee, In Seo Kim, Seon Wook Kim,
Abstract要約: 大規模言語モデル(LLM)は、幅広い自然言語処理タスクにおいて優れた性能を達成している。 ZeroQuant, LLM.int8() や SmoothQuant といった既存の手法では、入力アクティベーションのアウトレイアとハードウェアの非効率に完全に対応していない。 MUXQ(Mixed-to-Uniform Quantization)を提案する。
参考スコア（独自算出の注目度）: 0.196629787330046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have achieved outstanding performance across a wide range of natural language processing tasks, but their enormous parameter counts impose ubstantial memory and computational overheads. This challenge is particularly critical in NPU-based on-device environments, where FP16/FP32 computation is inefficient and integer (INT) quantization is therefore essential. However, existing methods, including ZeroQuant, LLM.int8(), and SmoothQuant, do not fully address input-activation outliers and the associated hardware inefficiencies. To overcome these limitations, we propose MUXQ (Mixed-to-Uniform Quantization). MUXQ detects outlier channels in input activations and introduces a small auxiliary matrix that redistributes outlier magnitudes across channels, thereby alleviating the outlier problem. This enables even activation outliers to be quantized at low-precision INT levels while preserving a hardware-friendly computation structure. Experiments on GPT-2 models at three scales (0.1B, 0.3B, and 0.7B parameters) using the WikiText-2 dataset show that MUXQ consistently achieves lower perplexity than naive quantization. In particular, under per-tensor quantization, MUXQ quantizes both activations and weights to INT8 while maintaining accuracy close to that of FP16. With only modest computational overhead, MUXQ enables stable low-precision inference and can be readily combined with other quantization techniques. These results suggest that MUXQ provides a promising direction for efficient and accurate LLM inference on edge devices.
Abstract（参考訳）: 大規模言語モデル(LLM)は、幅広い自然言語処理タスクにおいて優れた性能を達成しているが、その膨大なパラメータ数は、メモリと計算上のオーバーヘッドを課している。この課題は、FP16/FP32計算が非効率であり、整数量子化(INT)が不可欠であるNPUベースのオンデバイス環境において特に重要である。しかし、ZeroQuant、LLM.int8()、SmoothQuantなどの既存の手法では、入力アクティベーションの外れ値とハードウェアの非効率性に完全に対応していない。これらの制限を克服するため、MUXQ(Mixed-to-Uniform Quantization)を提案する。 MUXQは入力アクティベーションにおける外れ値チャネルを検出し、チャネル間で外れ値の規模を再分配する小さな補助行列を導入し、オフ値問題を緩和する。これにより、ハードウェアフレンドリーな計算構造を維持しながら、低精度のINTレベルでのアクティベーションアウトレーヤの量子化が可能となる。 WikiText-2データセットを用いたGPT-2モデルの3つのスケール(0.1B、0.3B、0.7Bパラメータ)での実験は、MUXQが単純量子化よりも低いパープレキシティを一貫して達成していることを示している。特に、テンソル単位の量子化の下では、MUXQはFP16に近い精度を維持しながら、活性化と重みの両方をINT8に量子化する。最小限の計算オーバーヘッドしか持たず、MUXQは安定な低精度推論を可能にし、他の量子化技術と容易に組み合わせることができる。これらの結果から,MUXQはエッジデバイス上でのLLM推論を効率的かつ高精度に行う上で有望な方向を示すことが示唆された。

論文の概要: MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition

関連論文リスト