Fugu-MT 論文翻訳(概要): EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

論文の概要: EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

arxiv url: http://arxiv.org/abs/2605.04062v1
Date: Fri, 10 Apr 2026 15:49:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 06:56:26.552521
Title: EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
Title（参考訳）: EdgeRazor: 混合精度量子化による大規模言語モデルのための軽量フレームワーク
Authors: Shu-Hao Zhang, Le-Tong Huang, Xiang-Sheng Deng, Xin-Yi Zou, Chen Wu, Nan Li, Shao-Qun Zhang,
Abstract要約: 量子化は、完全な精度のモデル重みとアクティベーションを低ビットフォーマットに変換する、有望な軽量技術として登場した。 We propose EdgeRazor, a lightweight framework for LLMs with mixed-precision and very low-bit weight Quantization。
参考スコア（独自算出の注目度）: 11.001227228468572
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent years have witnessed an increasing interest in deploying LLMs on resource-constrained devices, among which quantization has emerged as a promising lightweight technique that converts full-precision model weights and activations into lower-bit formats. Existing weight quantization approaches can be roughly divided into three categories: Post-Training Quantization (PTQ) that calibrates quantized parameters on a small dataset without retraining but suffers from severe performance degradation below 4-bit, Quantization-Aware Training (QAT) that searches low-bit parameters using surrogate gradients but demands substantial computational resources, and Quantization-Aware Distillation that integrates QAT with knowledge transfer from a full-precision teacher but manually selects features to distill and relies heavily on teacher-specific data. In this paper, we propose EdgeRazor, a lightweight framework for LLMs with mixed-precision and extremely low-bit weight quantization. The EdgeRazor framework contains three modules: Mixed-Precision Quantization-Aware Distillation for the fine-grained control of precision, Adaptive Feature Distillation that derives an $n$-bit student from its 16-bit teacher, and Entropy-Aware KL Divergence on both human-annotated and distilled datasets, whose forward-reverse balance is determined solely by the teacher's output distribution. Empirical investigations of EdgeRazor are conducted on base, instruction-tuned, and multimodal LLMs. Notably, EdgeRazor with 1.88-bit surpasses all contenders with the 3-bit precision, especially outperforms the leading 2-bit PTQ methods by 11.3 points, within a 4-10$\times$ lower training budget than the leading QAT approach. EdgeRazor delivers higher compression ratios at all bit width; the 1.58-bit Qwen3-0.6B reduces storage from 1.41 GB to 0.28 GB while accelerating decoding by 15.1$\times$ relative to the 16-bit baseline.
Abstract（参考訳）: 近年、リソース制約のあるデバイスにLSMをデプロイすることへの関心が高まっており、量子化は完全なモデルの重み付けとアクティベーションを低ビットフォーマットに変換する有望な軽量技術として現れている。既存の量化アプローチは、トレーニング後の量子化(PTQ)は、小さなデータセット上で量子化パラメータをリトレーニングせずに校正するが、4ビット未満の性能低下に悩まされる、量子化意識訓練(QAT)、サロゲート勾配を使って低ビットパラメータを探索するが、相当な計算資源を必要とする、量子化意識蒸留(QAT)、QATを完全精度の教師からの知識伝達と統合するが、教師固有のデータに大きく依存する特徴を手動で選択する、という3つのカテゴリに分けられる。本稿では,混合精度と極低ビット量量子化を備えたLDMのための軽量フレームワークであるEdgeRazorを提案する。 EdgeRazorフレームワークには、3つのモジュールが含まれている: 精度のきめ細かい制御のためのMixed-Precision Quantization-Aware Distillation、その16ビットの教師から$n$-bitの学生を誘導するAdaptive Feature Distillation、そしてEntropy-Aware KL Divergence on both human- Annotated and distilled datasets, which forward-reverse balance is only by the teacher's output distribution。 EdgeRazorの実証的な研究は、ベース、命令調整、マルチモーダルLLMで行われている。特に1.88ビットのEdgeRazorは3ビットの精度で全ての競技者を追い越しており、特にトップの2ビットのPTQ手法を4-10$\times$のトレーニング予算で11.3ポイント上回っている。 1.58ビットのQwen3-0.6Bはストレージを1.41GBから0.28GBに減らし、デコーディングは16ビットベースラインに対して15.1$\times$で加速する。

論文の概要: EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

関連論文リスト