Fugu-MT 論文翻訳(概要): SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

論文の概要: SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

arxiv url: http://arxiv.org/abs/2606.11244v1
Date: Thu, 04 Jun 2026 22:38:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.077296
Title: SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving
Title（参考訳）: SPEAR: 効率的な低ビットLDMサービングを実現する量子化後エラー適応リカバリシステム
Authors: Hongyuan Liu, Yawei Li, Zhiqiang Que, Qinli Yang, Junming Shao, Guosheng Hu,
Abstract要約: 本稿では,低ビットLLM機能を改善するポスト量子化誤り適応リカバリシステムSPEARを提案する。 SPEARは、トーケンゲートごとに軽量なエラー補償器(EC)を導入し、CKA誘導エントロピー認識診断によって識別される最もエラーに敏感な層にのみ配置する。我々は,SPEARがW4とFP16の難易度ギャップの56-75%を回復し,1%未満のモデルメモリオーバヘッドを付加し,広く使用されている4ビットサービスデプロイメントに匹敵するレイテンシを維持することを示した。
参考スコア（独自算出の注目度）: 26.96887030437247
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient large language model (LLM) serving is increasingly constrained by deployment cost. Quantization is a key technique for reducing serving cost, yet even state-of-the-art 4-bit quantizers exhibit a noticeable quality gap from FP16, particularly for smaller models where low-bit serving is most beneficial. We identify a fundamental cause of this gap: quantization error is highly input-dependent and varies substantially across tokens, while existing post-quantization compensation methods are static and apply identical corrections to all inputs. As a result, easy tokens are over-corrected while hard tokens remain under-corrected. We present SPEAR, a system for post-quantization error-adaptive recovery that improves low-bit LLM serving. SPEAR introduces lightweight Error Compensators (ECs) modulated by per-token gates and places them only at the most error-sensitive layers identified through a CKA-guided entropy-aware diagnostic. This focuses a small parameter budget where it is most effective. Efficient deployment of ECs presents several systems challenges, including additional computation, tensor-parallel synchronization caused by input-dependent gating, and latency instability across configurations. SPEAR addresses these issues through adaptive kernel-fusion dispatch, combining an epilogue-integrated peer-reduction kernel with P2P dual-write to fuse the post-EC computation into low-bit GEMMs, and an SLO-constrained EC-aware scheduler for predictable serving performance. Across challenging per-channel quantization settings, SPEAR recovers 56-75% of the perplexity gap between W4 and FP16 while adding less than 1% model memory overhead and maintaining latency comparable to a widely used 4-bit serving deployment.
Abstract（参考訳）: 効率的な大規模言語モデル(LLM)の提供は、デプロイメントコストによってますます制限される。量子化はサービスコストを削減するための重要な手法であるが、最先端の4ビット量子化器でさえ、特に低ビットサービスが最も有用である小さなモデルにおいて、FP16と顕著な品質差を示す。量子化誤差は高い入力依存であり、トークン間で大きく異なるが、既存の量子化後補償法は静的であり、全ての入力に同一の補正を適用する。その結果、簡単なトークンは過度に修正され、ハードトークンは過度に修正される。本稿では,低ビットLLM機能を改善するポスト量子化誤り適応リカバリシステムSPEARを提案する。 SPEARは、トーケンゲートごとに変調された軽量なエラー補償器(EC)を導入し、CKA誘導エントロピー認識診断によって識別される最もエラーに敏感な層にのみ配置する。これは、最も効果的である小さなパラメータ予算に焦点を当てます。 ECの効率的なデプロイには、追加の計算、入力依存ゲーティングによるテンソル並列同期、構成間のレイテンシ不安定など、いくつかのシステム課題がある。 SPEARは、エピローグ統合されたピア還元カーネルとP2Pデュアルライトを組み合わせて、ポストEC計算を低ビットGEMMに融合させ、SLO制約のEC-awareスケジューラで予測可能なサービス性能を実現する。 SPEARは、チャネルごとの量子化設定に挑戦する一方で、W4とFP16の難易度ギャップの56-75%を回復し、1%以下のモデルメモリオーバーヘッドを追加し、広く使用されている4ビットサービスデプロイメントに匹敵するレイテンシを維持する。

論文の概要: SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

関連論文リスト