Fugu-MT 論文翻訳(概要): SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization

論文の概要: SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization

arxiv url: http://arxiv.org/abs/2603.08185v1
Date: Mon, 09 Mar 2026 10:04:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.782936
Title: SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization
Title（参考訳）: SERQ: LLM量子化のためのSERQ-Aware Low-Rank Error Restruction
Authors: Yeonsik Park, Hyeonseong Kim, Seungkyu Choi,
Abstract要約: 学習後量子化(PTQ)は,大規模言語モデルを効率的に展開するための一般的な手法として登場した。 SERQは1つの低ランク補償行列を用いる低ビットLLM推論のためのサリエンシ対応誤差再構成法である。
参考スコア（独自算出の注目度）: 7.372706701787234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training quantization (PTQ) has emerged as a prevailing technique for deploying large language models (LLMs) efficiently in terms of both memory and computation, across edge devices and server platforms. Existing PTQ methods primarily aim to reduce precision in weights and activations by mitigating quantization errors caused by channel-wise outlier activations (e.g., pre-quantization scaling, online transformations, or low-rank error reconstruction). Among these approaches, error reconstruction with low-rank adaptation (LoRA) has proven particularly effective, as it introduces a lightweight auxiliary computation path without requiring heavy optimization or additional online layers. However, prior studies reveal severe accuracy degradation under W4A4 settings, and conventional low-rank adaptations rely on two sequential factors, necessitating intermediate quantization during inference and thereby limiting low-precision efficiency. In this work, we propose SERQ, a saliency-aware error reconstruction method for low-bit LLM inference that employs a single low-rank compensation matrix. SERQ preserves efficient 4-bit matrix multiplication in linear layers by jointly mitigating quantization errors arising from both activation and weight saliency through three stages: (1) static activation flattening, (2) saliency-aware error reconstruction, and (3) offline weight permutation. The method incurs additional computation only for low-rank error reconstruction via a single decomposition, while all other operations are performed offline, thereby keeping latency overhead minimal. Empirically, SERQ outperforms prior error reconstruction methods under both W4A8 and W4A4 settings, and achieves higher accuracy than state-of-the-art rotation-based W4A4 approaches, while substantially reducing calibration complexity.
Abstract（参考訳）: 学習後の量子化(PTQ)は、大規模言語モデル(LLM)をエッジデバイスとサーバプラットフォームにまたがってメモリと計算の両方を効率的にデプロイするための一般的なテクニックとして登場した。既存のPTQ法は主に、チャネルワイドのアウトリアアクティベーションによる量子化エラー(例えば、事前量子化スケーリング、オンライン変換、低ランクエラー再構成)を緩和することで、重量とアクティベーションの精度を下げることを目的としている。これらの手法のうち、低ランク適応(LoRA)によるエラー再構成は、重い最適化や追加のオンライン層を必要とせず、軽量な補助計算経路を導入することで特に有効であることが証明されている。しかしながら、以前の研究では、W4A4設定下での高精度な精度劣化が明らかにされており、従来の低ランク適応は、推論中に中間量子化を必要とする2つのシーケンシャルな要因に依存しており、それによって低精度効率が制限される。本研究では,1つの低ランク補償行列を用いた低ビットLLM推論のためのSERQを提案する。 SERQは,(1) 静活性化平ら化,(2) 塩分認識誤差再構成,(3) オフライン重量置換の3段階を通じて,活性化と重量塩分の両方から生じる量子化誤差を和らげることで,線形層における効率的な4ビット行列乗算を保っている。この手法は1つの分解によって低ランクのエラー再構成にのみ追加計算を発生させるが、他のすべての操作はオフラインで実行されるため、遅延オーバーヘッドを最小限に抑えることができる。実証的には、SERQはW4A8とW4A4の両方の設定で事前のエラー再構成方法より優れており、最先端のローテーションベースのW4A4アプローチよりも精度が高く、キャリブレーションの複雑さを著しく低減している。

論文の概要: SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization

関連論文リスト