Fugu-MT 論文翻訳(概要): BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

論文の概要: BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

arxiv url: http://arxiv.org/abs/2605.00422v1
Date: Fri, 01 May 2026 05:42:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.860843
Title: BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs
Title（参考訳）: BWLA:LDMのW1AX後処理量子化の障壁を突破する
Authors: Zhixiong Zhao, Zukang Xu, Dawei Yang,
Abstract要約: 大規模言語モデル(LLM)は、NLPに大きな進歩をもたらしたが、そのかなりのメモリと計算要求は、まだ実用的なデプロイメントを妨げている。我々は,1ビットの重み量子化を達成しつつ,高精度な学習後量子化フレームワークであるBWLAを提案する。 Qwen3-32Bでは、BWLAは6ビットアクティベーションでWikitext2の難易度11.92に達し、5つのゼロショットタスクを70%以上改善し、3.26倍の推論速度を提供する。
参考スコア（独自算出の注目度）: 10.07268309735318
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical deployment. Binarization can compress weights to 1 bit, fundamentally lowering compute and bandwidth cost. However, existing methods cannot address activation heavy tails and thus must keep activations in high precision, preventing true end-to-end acceleration. To overcome this limitation, we propose BWLA (Binarized Weights and Low-bit Activations), the first post-training quantization framework that preserves high accuracy while achieving 1-bit weight quantization together with low-bit activations (e.g., 6 bits). The Orthogonal-Kronecker Transformation (OKT) learns an orthogonal mapping via EM minimization, converting unimodal weights into symmetric bimodal forms while suppressing activation tails and incoherence. The Proximal SVD Projection (PSP) then performs lightweight low-rank refinement through proximal SVD projection, further enhancing quantizability with minimal overhead. On Qwen3-32B, BWLA reaches a Wikitext2 perplexity of 11.92 under 6-bit activations (vs. 38 from SOTA), improves five zero-shot tasks by more than 70%, and delivers 3.26 times inference speedup, demonstrating strong potential for real-world LLM compression and acceleration.
Abstract（参考訳）: 大規模言語モデル(LLM)は、NLPに大きな進歩をもたらしたが、そのかなりのメモリと計算要求は、まだ実用的なデプロイメントを妨げている。バイナリ化は重みを1ビットに圧縮することができ、基本的に計算コストと帯域幅コストを下げる。しかし、既存の方法は活性化ヘビーテールに対処できないため、アクティベーションを高精度に保ち、真のエンドツーエンドのアクティベーションを防ぐ必要がある。この制限を克服するため,BWLA (Binarized Weights and Low-bit Activations) を提案する。 Orthogonal-Kronecker Transformation (OKT) は、EM最小化による直交写像を学習し、活性化尾部と不整合を抑えながら、一乗重みを対称二乗形式に変換する。近位SVDプロジェクション(PSP)は、近位SVDプロジェクションを通じて軽量の低ランクリファインメントを実行し、最小限のオーバーヘッドで量子化性を高める。 Qwen3-32Bでは、BWLAは6ビットアクティベーションで11.92のWikitext2パープレキシティに達し(SOTAから38まで)、5つのゼロショットタスクを70%以上改善し、3.26倍の推論スピードアップを実現し、現実のLLM圧縮とアクセラレーションの強力な可能性を示している。

論文の概要: BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

関連論文リスト