NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models
- URL: http://arxiv.org/abs/2602.06694v1
- Date: Fri, 06 Feb 2026 13:26:44 GMT
- Title: NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models
- Authors: Hyochan Chong, Dongkyu Kim, Changdong Kim, Minseop Choi,
- Abstract summary: NanoQuant formulates quantization as a low-rank binary factorization problem.<n>It compresses full-precision weights to low-rank binary matrices and scales.<n>It achieves state-of-the-art accuracy even at sub-1-bit compression rates.
- Score: 0.7349727826230863
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) method to precisely initialize latent binary matrices and scales, and then tune the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, achieving state-of-the-art accuracy even at sub-1-bit compression rates. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8$\times$ in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU.
Related papers
- Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z) - Binary Quantization For LLMs Through Dynamic Grouping [13.578307208515819]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of Natural Language Processing (NLP) tasks.<n> Binary quantization, which compresses model weights from 16-bit Brain Float to 1-bit representations in -1, 1, offers significant reductions in storage and inference costs.<n>We propose a novel optimization objective tailored for binary quantization, along with three algorithms designed to realize it effectively.
arXiv Detail & Related papers (2025-09-03T06:36:21Z) - Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design [34.04231165571518]
Speculative decoding and quantization effectively accelerate memory-bound inference of large language models.<n>Quantization achieves this by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications.<n>Experiments show that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding.
arXiv Detail & Related papers (2025-05-28T09:55:08Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.<n>We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.<n> EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - GPTQT: Quantize Large Language Models Twice to Push the Efficiency [1.3149617027696827]
This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed.
Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting.
GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding.
arXiv Detail & Related papers (2024-07-03T08:08:01Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.<n>Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - Neural Network Compression using Binarization and Few Full-Precision
Weights [7.206962876422061]
Automatic Prune Binarization (APB) is a novel compression technique combining quantization with pruning.
APB enhances the representational capability of binary networks using a few full-precision weights.
APB delivers better accuracy/memory trade-off compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-06-15T08:52:00Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.