Fugu-MT 論文翻訳(概要): LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs

論文の概要: LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs

arxiv url: http://arxiv.org/abs/2605.29756v1
Date: Thu, 28 May 2026 11:02:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.191666
Title: LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs
Title（参考訳）: LFQ:低ビット量子化LDMの生成品質向上のためのログ対応ファイナルブロック量子化
Authors: Jung Hyun Lee, June Yong Yang, Jungwook Choi, Eunho Yang,
Abstract要約: ブロックワイドPTQの簡易かつ効果的な拡張であるLFQ(Logit-aware Final-block Quantization)を導入する。 LFQは、最先端のブロックワイドPTQよりも複雑な生成タスクの精度を一貫して改善する。
参考スコア（独自算出の注目度）: 52.1276403258812
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models continue to scale, low-bit weight-only post-training quantization (PTQ) offers a practical solution to their memory-efficient deployment. Although block-wise PTQ is capable of matching the full-precision (FP) baseline on basic language modeling and understanding, its quality is degraded for generative tasks -- especially at longer responses and extended chains of thought, which is critical in boosting task accuracy. We attribute this shortfall to two factors: (i) the omission of the unembedding layer (the LM head) in block-wise optimization and (ii) the reliance on the mean squared error (MSE) objective. Both factors cause the token probability distribution of the quantized model to misalign with that of the FP model, yielding notable accuracy drops on text generation benchmarks. To rectify the discrepancy, we introduce Logit-aware Final-block Quantization (LFQ), a simple yet effective enhancement to block-wise PTQ that quantizes the final Transformer block by minimizing the cross-entropy between the logits of the FP model and those of its quantized counterpart. By aligning token probabilities at the logit level in the final block, LFQ consistently improves the accuracy of complex generation tasks over state-of-the-art block-wise PTQ across diverse model families, while maintaining parity with FP baselines on language modeling and understanding.
Abstract（参考訳）: 大規模言語モデルが拡大を続けるにつれ、低ビット量のみのポストトレーニング量子化(PTQ)は、メモリ効率のデプロイに対して実用的なソリューションを提供する。ブロックワイズPTQは、基本言語モデリングと理解に基づくフル精度(FP)ベースラインをマッチングできるが、その品質は、生成タスク -- 特に長い応答と、タスクの正確性を高める上で重要な思考の連鎖 -- において劣化する。この欠点は2つの要因に起因している。一ブロックワイド最適化及び非埋め込み層(LMヘッド)の省略 (二)平均二乗誤差(MSE)の目的に依存すること。どちらの要因も量子化モデルのトークン確率分布をFPモデルと誤認させ、テキスト生成ベンチマークに顕著な精度低下をもたらす。差分を補正するために、FPモデルのロジットとその量子化されたブロックとのクロスエントロピーを最小化することにより、最終トランスフォーマーブロックを量子化するブロックワイドPTQの簡易かつ効果的な拡張である、ロジット対応ファイナルブロック量子化(LFQ)を導入する。最終ブロックのロジットレベルでトークン確率を調整することにより、LFQは、言語モデリングと理解に基づくFPベースラインと同等を維持しながら、様々なモデルファミリにわたる最先端のブロックワイドPTQよりも複雑な生成タスクの精度を一貫して向上する。

論文の概要: LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs

関連論文リスト