Fugu-MT 論文翻訳(概要): Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

論文の概要: Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

arxiv url: http://arxiv.org/abs/2606.12876v1
Date: Thu, 11 Jun 2026 04:06:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.57212
Title: Multi-Bitwidth Quantization for LLMs Using Additive Codebooks
Title（参考訳）: 付加コードブックを用いたLLMのマルチビット量子化
Authors: Liza Babaoglu, Shuangyi Chen, Ashish Khisti,
Abstract要約: 大規模言語モデル(LLM)は、リソース制約の異なる異種ハードウェアに徐々に展開されている。本研究では,1つのトレーニングモデルからLLM重みの推測時間精度制御を可能にする,新しい学習後量子化フレームワークであるDrop-by-Dropを提案する。
参考スコア（独自算出の注目度）: 12.237109162791091
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.
Abstract（参考訳）: 大規模言語モデル(LLM)は、リソース制約の異なる異種ハードウェアに徐々に展開されるため、再トレーニングなしにパフォーマンスと効率のトレードオフを適応的に管理する能力は不可欠である。本研究では,1つのトレーニングモデルからLLM重みの推測時間精度制御を可能にする,新しい学習後量子化フレームワークであるDrop-by-Dropを提案する。本手法は情報理論と逐次改良に基礎を置いている。ガウス分布によく従うLLM重みは、LLM損失関数によって動機付けられた重み付き平均2乗誤差歪みの下で、加算ビットが組み込まれるにつれて、忠実度の増加とともに最適に再構成できることを示す。これを実現するために、Drop-by-DropはMatryoshkaスタイルの監視機能を損失関数に組み込み、付加的なコードブックの構造を利用する。 Drop-by-Dropは、コードブックの順序付けられたサブセットがそれぞれの精度レベルで正確な部分的再構成をもたらす単一のモデルを生成する。このアプローチは、Qwen、LLaMA、Gemma、Mistralといった主要なアーキテクチャで競合するパープレクティリティと精度を維持しながら、単一のチェックポイントを複数のビット幅で提供することによって、ストレージとメモリオーバーヘッドを大幅に削減する。

論文の概要: Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

関連論文リスト