CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs
- URL: http://arxiv.org/abs/2405.17233v2
- Date: Mon, 3 Jun 2024 02:46:53 GMT
- Title: CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs
- Authors: Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian,
- Abstract summary: Column-Level Adaptive weight Quantization (CLAQ) is a novel and effective framework for Large Language Models (LLMs) quantization.
In this paper, we present a novel and effective CLAQ framework by introducing three different types of adaptive strategies for LLM quantization.
Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings.
- Score: 44.03692512352445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies for LLM quantization. Firstly, a K-Means clustering based algorithm is proposed that allows dynamic generation of quantization centroids for each column of a parameter matrix. Secondly, we design an outlier-guided adaptive precision search strategy which can dynamically assign varying bit-widths to different columns. Finally, a dynamic outlier reservation scheme is developed to retain some parameters in their original float point precision, in trade off of boosted model performance. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings, especially in extremely low-bit scenarios. Code is available at https://github.com/fayuge/CLAQ.
Related papers
- SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching [32.4599581528901]
"Two-tower" architecture is used for compressing pre-trained LLM parameters into compact representations and fine-tuning the additive full-precision adapter.
We propose SpaLLM (Sketched Adapting of LLMs), a novel compressive adaptation approach for LLMs.
We show that SpaLLM sketches pre-trained LLM weights into lookup tables and directly fine-tunes the values in these tables.
arXiv Detail & Related papers (2024-10-08T20:58:24Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.
Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - Scaling Sparse Fine-Tuning to Large Language Models [67.59697720719672]
Large Language Models (LLMs) are difficult to fully fine-tune due to their sheer number of parameters.
We propose SpIEL, a novel sparse finetuning method which maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values.
We show that SpIEL is superior to popular parameter-efficient fine-tuning methods like LoRA in terms of performance and comparable in terms of run time.
arXiv Detail & Related papers (2024-01-29T18:43:49Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - QuantEase: Optimization-based Quantization for Language Models [17.333778751252392]
This work introduces Quantization (PTQ) of various quantization layers from recent advances of Large Language Models (LLMs)
Our CD-based approach features straightforward updates, relying solely on vector operations.
We also explore an outlier approach, allowing for retaining significant weights (outoutliers) with complete precision.
arXiv Detail & Related papers (2023-09-05T01:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.