Fugu-MT 論文翻訳(概要): WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization

論文の概要: WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization

arxiv url: http://arxiv.org/abs/2605.26660v2
Date: Sun, 31 May 2026 09:30:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 00:57:58.889971
Title: WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization
Title（参考訳）: WINDQuant:大域的混合精度LDM量子化のための重み付きニューラル決定法
Authors: Phong Nam Huu Nguyen, Khoi M. Le, Cong-Duy T Nguyen, Anh Tuan Luu, Thong Thanh Nguyen, Tho Quan,
Abstract要約: WINDQuantは超低ビットLLM量子化のための強化学習に基づくアロケーションコントローラである。グローバルストレージ予算の下で、ビット幅と量子化処理をきめ細かいカラムチャンクに割り当てる方法を学ぶ。 LLaMAモデルを用いた実験により、WINDQuantは超低ビット設定で競合性能を達成することが示された。
参考スコア（独自算出の注目度）: 40.655670203062805
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Quantization is an effective approach to reduce the memory footprint and inference cost of large language models (LLMs), yet maintaining performance in the ultra-low-bit regime remains challenging. Existing post-training methods often suffer from severe accuracy degradation, while quantization-aware training requires costly retraining and additional resources. Moreover, most mixed-precision strategies rely on coarse-grained or heuristic sensitivity analysis that overlooks fine-grained variations within weight matrices. We propose WINDQuant, a reinforcement-learning-based allocation controller for ultra-low-bit LLM quantization. Rather than introducing another low-level quantization operator, WINDQuant learns how to assign bit-widths and quantization treatments to fine-grained column chunks under a global storage budget. By operating at the column-chunk level, WINDQuant enables flexible and fine-grained precision assignment within layers under a global target bit-width. The implementation combines PPO with activation-aware calibration, lightweight per-unit quantizer fitting, and explicit effective-bit accounting of the learned mixed-precision plan. Experiments on LLaMA models demonstrate that WINDQuant achieves competitive performance in ultra-low-bit settings while reducing optimization overhead relative to retraining-based approaches, highlighting reinforcement learning as a practical controller for adaptive mixed-precision quantization.
Abstract（参考訳）: 量子化は、大きな言語モデル(LLM)のメモリフットプリントと推論コストを削減するための効果的なアプローチであるが、超低ビット状態における性能を維持することは依然として困難である。既存のポストトレーニング手法は、しばしば深刻な精度の劣化に悩まされるが、量子化対応のトレーニングには、コストのかかる再トレーニングと追加のリソースが必要である。さらに、ほとんどの混合精度戦略は、重量行列のきめ細かい変化を見落としている粗い粒度またはヒューリスティックな感度分析に依存している。超低ビットLLM量子化のための強化学習に基づくアロケーションコントローラWINDQuantを提案する。 WINDQuantは、別の低レベル量子化演算子を導入する代わりに、ビット幅と量子化処理をグローバルストレージ予算の下で細粒度カラムチャンクに割り当てる方法を学ぶ。カラムチャンクレベルで操作することで、WINDQuantはグローバルなターゲットビット幅の下の層内で柔軟できめ細かい精度割り当てを可能にする。この実装はPPOとアクティベーション対応キャリブレーション、単位単位当たりの量子化器の軽量化、学習された混合精度計画の明示的な有効ビット会計を組み合わせる。 LLaMAモデルを用いた実験により、WINDQuantは、適応的混合精度量子化のための実用的なコントローラとして強化学習を強調しながら、超低ビット設定での競合性能を実現する。

論文の概要: WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization

関連論文リスト