CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization
- URL: http://arxiv.org/abs/2406.17542v3
- Date: Tue, 22 Oct 2024 18:51:01 GMT
- Title: CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization
- Authors: Pranav Ajit Nair, Arun Sai Suggala,
- Abstract summary: Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks.
Quantization has emerged as a key technique for enabling the compression of large models with minimal impact on performance.
The GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs.
We introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance.
- Score: 8.92409376299856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. But their deployment is often constrained by their substantial computational and storage requirements. Quantization has emerged as a key technique for addressing this challenge, enabling the compression of large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses greedy coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. We perform extensive evaluation on Gemma, and PaLM2 model families, and demonstrate that CDQuant consistently outperforms GPTQ in 2-4 bit weight quantization. Moreover, CDQuant improves the performance of state-of-the-art PTQ techniques such as QuIP and FrameQuant when used as a replacement for their GPTQ component, resulting in further gains in quality.
Related papers
- LRQ-DiT: Log-Rotation Post-Training Quantization of Diffusion Transformers for Text-to-Image Generation [34.14174796390669]
Post-training quantization (PTQ) is a promising solution to reduce memory usage and accelerate inference.<n>Existing PTQ methods suffer from severe performance degradation under extreme low-bit settings.<n>We propose LRQ-DiT, an efficient and accurate PTQ framework.
arXiv Detail & Related papers (2025-08-05T14:16:11Z) - The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm [52.89358421626026]
GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale.<n>We show that GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem.
arXiv Detail & Related papers (2025-07-24T16:22:18Z) - GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers [11.452135395287119]
Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too.<n>Model quantization aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations.<n>This paper introduces General, Practical, and Quantization (GPLQ), a novel framework for efficient ViT quantization.
arXiv Detail & Related papers (2025-06-13T13:45:17Z) - FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation [55.12070409045766]
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years.<n>Current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization.
arXiv Detail & Related papers (2025-06-13T07:57:38Z) - Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization [0.0]
Post-training quantization has emerged as a widely used technique for compressing large language models (LLMs) without retraining.
The accumulation of quantization errors across layers significantly degrades performance, particularly in low-bit regimes.
We propose Quantization Error propagation (QEP), a lightweight and general framework that enhances layer-wise PTQ by explicitly propagating the quantization error.
arXiv Detail & Related papers (2025-04-13T15:56:00Z) - Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression [55.323397702682506]
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining.
We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery.
arXiv Detail & Related papers (2025-04-10T02:19:03Z) - APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers [71.2294205496784]
We propose textbfAPHQ-ViT, a novel PTQ approach based on importance estimation with Average Perturbation Hessian (APH)
We show that APHQ-ViT using linear quantizers outperforms existing PTQ methods by substantial margins in 3-bit and 4-bit across different vision tasks.
arXiv Detail & Related papers (2025-04-03T11:48:56Z) - AHCPTQ: Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model [0.6827423171182154]
The Segment Anything Model (SAM) has demonstrated strong versatility across various visual tasks.
Post-training quantization (PTQ) has emerged as an effective strategy for efficient deployment.
We propose AHCPTQ, an accurate and hardware-efficient PTQ method for SAM.
arXiv Detail & Related papers (2025-03-05T01:04:45Z) - Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models [13.709080134204326]
Post-Training Quantization (PTQ) is pivotal for deploying large language models within resource-limited settings.
We introduce a Mixed-precision Graph Neural PTQ (MG-PTQ) approach, employing a graph neural network (GNN) module to capture dependencies among weights.
Our method more effectively captures dependencies among target weights, leading to a more accurate assessment of weight importance.
arXiv Detail & Related papers (2025-01-30T05:39:01Z) - GWQ: Gradient-Aware Weight Quantization for Large Language Models [61.17678373122165]
gradient-aware weight quantization (GWQ) is the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers.
GWQ retains the corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format.
In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.
arXiv Detail & Related papers (2024-10-30T11:16:04Z) - A Quantum Circuit-Based Compression Perspective for Parameter-Efficient Learning [19.178352290785153]
We introduce Quantum s Adaptation (QPA) in the framework of quantum parameter generation.
QPA integrates QNNs with a classical multi-layer perceptron mapping model to generate parameters for fine-tuning methods.
Using Gemma-2 and GPT-2 as case studies, QPA demonstrates significant parameter reduction for parameter-efficient fine-tuning methods.
arXiv Detail & Related papers (2024-10-13T14:09:29Z) - ERQ: Error Reduction for Post-Training Quantization of Vision Transformers [48.740630807085566]
Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models.
We propose ERQ, a two-step PTQ approach meticulously crafted to sequentially reduce the quantization error arising from activation and weight quantization.
ERQ surpasses the state-of-the-art GPTQ by 22.36% in accuracy for W3A4 ViT-S.
arXiv Detail & Related papers (2024-07-09T12:06:03Z) - GPTQT: Quantize Large Language Models Twice to Push the Efficiency [1.3149617027696827]
This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed.
Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting.
GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding.
arXiv Detail & Related papers (2024-07-03T08:08:01Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z) - Gradient-Based Post-Training Quantization: Challenging the Status Quo [23.1120983784623]
Quantization has become a crucial step for the efficient deployment of deep neural networks.
In this work, we show that the process is, to a certain extent, robust to a number of variables.
We derive a number of best practices for designing more efficient and scalable GPTQ methods.
arXiv Detail & Related papers (2023-08-15T09:25:11Z) - Weight Re-Mapping for Variational Quantum Algorithms [54.854986762287126]
We introduce the concept of weight re-mapping for variational quantum circuits (VQCs)
We employ seven distinct weight re-mapping functions to assess their impact on eight classification datasets.
Our results indicate that weight re-mapping can enhance the convergence speed of the VQC.
arXiv Detail & Related papers (2023-06-09T09:42:21Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - Benchmarking the Reliability of Post-training Quantization: a Particular
Focus on Worst-case Performance [53.45700148820669]
Post-training quantization (PTQ) is a popular method for compressing deep neural networks (DNNs) without modifying their original architecture or training procedures.
Despite its effectiveness and convenience, the reliability of PTQ methods in the presence of some extrem cases such as distribution shift and data noise remains largely unexplored.
This paper first investigates this problem on various commonly-used PTQ methods.
arXiv Detail & Related papers (2023-03-23T02:55:50Z) - RepQ-ViT: Scale Reparameterization for Post-Training Quantization of
Vision Transformers [2.114921680609289]
We propose RepQ-ViT, a novel PTQ framework for vision transformers (ViTs)
RepQ-ViT decouples the quantization and inference processes.
It can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
arXiv Detail & Related papers (2022-12-16T02:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.