Related papers: Boost Post-Training Quantization via Null Space Optimization for Large Language Models

Boost Post-Training Quantization via Null Space Optimization for Large Language Models

URL: http://arxiv.org/abs/2506.11044v3
Date: Sun, 26 Oct 2025 13:46:34 GMT
Title: Boost Post-Training Quantization via Null Space Optimization for Large Language Models
Authors: Jiaqi Zhao, Miao Zhang, Deng Xiang, Ming Li, Weili Guan, Liqiang Nie,
Abstract summary: Existing post-training quantization methods for large language models (LLMs) offer remarkable success.<n>The increasingly marginal performance gains suggest that existing quantization strategies are insufficient to support the development of more compressed models.<n>We argue that the quantization error can be effectively alleviated by constraining the post-quantization weight to lie within the null space of input activations.
Score: 66.73751310500656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing post-training quantization methods for large language models (LLMs) offer remarkable success. However, the increasingly marginal performance gains suggest that existing quantization strategies are insufficient to support the development of more compressed models. To inspire new directions for future research, this paper introduces the concept of null space into LLMs quantization. We argue that the quantization error can be effectively alleviated by constraining the post-quantization weight perturbation to lie within the null space of input activations. To prove this idea, we propose a plug-and-play null space projection module for existing milestone PTQ baselines named Q2N. Specifically, we first design an efficient and accurate null space projection approximation method tailored to the characteristics of LLMs. Subsequently, we theoretically derive a closed-form solution for an equivalent vector of the obtained projection matrix, which satisfies practical inference condition while avoiding additional memory overhead. Extensive experiments are conducted on various state-of-the-art LLMs (LLaMA3, DeepSeek, Qwen3) and baselines, demonstrating the effectiveness of both our Q2N and the perspective of null space optimization for LLMs quantization. We view this paper the first step to further alleviate the quantization error based on the insights of null space, hoping it inspiring future researchers to design more advanced quantization methods. Codes are available at https://github.com/zjq0455/q2n.

Related papers

Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models [41.677469535447024]
Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices.<n>Post-training quantization (PTQ) is widely adopted for its efficiency, as it requires no retraining and only a small dataset for calibration.<n>Recent advances for post-training quantization have demonstrated that even sub-4-bit methods can maintain most of the original model performance.
arXiv Detail & Related papers (2025-12-25T12:39:36Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs [1.4999444543328293]
Large language models (LLMs) have become pivotal in artificial intelligence, demonstrating strong capabilities in reasoning, understanding, and generating data.<n> Quantization is a widely used method to reduce memory usage and inference time, but LLMs present unique challenges due to the prevalence of outliers in their activations.<n>We demonstrate that Hadamard matrices are more effective in reducing outliers, which are a significant obstacle in achieving low-bit quantization.
arXiv Detail & Related papers (2025-04-18T13:46:58Z)
Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language.<n>LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint.<n>New frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70%.
arXiv Detail & Related papers (2025-03-10T09:26:08Z)
Quantize What Counts: Bit Allocation Insights Informed by Spectral Gaps in Keys and Values [57.54443445583921]
We provide two novel theorems aimed at enhancing KV quantization methods.<n>Our first theorem, termed Key-Value Norm Disparity, states that the key weight matrices by nature carry richer information.<n>Our second theorem, Key-Driven Quantization, posits that prioritizing the quantization precision of keys over values induces significant improvements to the overall quantization performance.
arXiv Detail & Related papers (2025-02-20T22:24:27Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z)
Pyramid Vector Quantization for LLMs [8.779688608449902]
Pyramid Vector Quantization (PVQ) for large language models.<n>PVQ uses a fixed integer lattice on the sphere by projecting points onto the 1-sphere, which allows for efficient encoding and decoding without requiring an explicit codebook in memory.<n>We achieve state-of-the-art quantization performance with pareto-optimal trade-off between performance and bits per weight and bits per activation, compared to compared methods.
arXiv Detail & Related papers (2024-10-22T11:57:32Z)
FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
PB-LLM: Partially Binarized Large Language Models [14.244537605866864]
This paper explores network binarization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. We propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs.
arXiv Detail & Related papers (2023-09-29T14:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.