Fugu-MT 論文翻訳(概要): NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

論文の概要: NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

arxiv url: http://arxiv.org/abs/2602.06694v1
Date: Fri, 06 Feb 2026 13:26:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-09 22:18:26.40888
Title: NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models
Title（参考訳）: NanoQuant: 大規模言語モデルの効率的なサブ-1ビット量子化
Authors: Hyochan Chong, Dongkyu Kim, Changdong Kim, Minseop Choi,
Abstract要約: NanoQuantは量子化を低ランク二乗分解問題として定式化する。フル精度の重みを低ランクのバイナリ行列やスケールに圧縮する。これは、サブ-1ビットの圧縮レートでも最先端の精度を達成する。
参考スコア（独自算出の注目度）: 0.7349727826230863
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) method to precisely initialize latent binary matrices and scales, and then tune the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, achieving state-of-the-art accuracy even at sub-1-bit compression rates. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8$\times$ in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU.
Abstract（参考訳）: 重みのみの量子化は、大規模言語モデル(LLM)を効率的に利用するための標準的アプローチとなっている。しかし、既存の手法では、大量のデータと計算または追加のストレージを必要とするため、モデルをバイナリ(1ビット)レベルに効率よく圧縮することができない。本研究では,LLMを2進法と1進法の両方に圧縮するPTQ法であるNanoQuantを提案する。 NanoQuantは量子化を低ランク二乗分解問題として定式化し、全精度重みを低ランク二乗行列とスケールに圧縮する。具体的には、乗算器法(ADMM)の効率的な交互方向法を用いて、潜在二乗行列とスケールを正確に初期化し、ブロックとモデル再構成プロセスを通じて初期化パラメータをチューニングする。結果としてNanoQuantは、低メモリ後の量子化において新しいParetoフロンティアを確立し、サブ-1ビット圧縮レートでも最先端の精度を達成する。 NanoQuantは、消費者向けハードウェア上で大規模なデプロイメントを実現する。例えば、1つのH100でわずか13時間でLlama2-70Bを25.8$\times$で圧縮し、70Bモデルがコンシューマ8GBのGPUで動作できるようにする。

論文の概要: NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

関連論文リスト