PB-LLM: Partially Binarized Large Language Models
- URL: http://arxiv.org/abs/2310.00034v2
- Date: Tue, 7 Nov 2023 20:41:22 GMT
- Title: PB-LLM: Partially Binarized Large Language Models
- Authors: Yuzhang Shang, Zhihang Yuan, Qiang Wu, Zhen Dong
- Abstract summary: This paper explores network binarization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression.
We propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs.
- Score: 14.244537605866864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper explores network binarization, a radical form of quantization,
compressing model weights to a single bit, specifically for Large Language
Models (LLMs) compression. Due to previous binarization methods collapsing
LLMs, we propose a novel approach, Partially-Binarized LLM (PB-LLM), which can
achieve extreme low-bit quantization while maintaining the linguistic reasoning
capacity of quantized LLMs. Specifically, our exploration first uncovers the
ineffectiveness of naive applications of existing binarization algorithms and
highlights the imperative role of salient weights in achieving low-bit
quantization. Thus, PB-LLM filters a small ratio of salient weights during
binarization, allocating them to higher-bit storage, i.e.,
partially-binarization. PB-LLM is extended to recover the capacities of
quantized LMMs, by analyzing from the perspective of post-training quantization
(PTQ) and quantization-aware training (QAT). Under PTQ, combining the concepts
from GPTQ, we reconstruct the binarized weight matrix guided by the Hessian
matrix and successfully recover the reasoning capacity of PB-LLM in low-bit.
Under QAT, we freeze the salient weights during training, explore the
derivation of optimal scaling factors crucial for minimizing the quantization
error, and propose a scaling mechanism based on this derived scaling strategy
for residual binarized weights. Those explorations and the developed
methodologies significantly contribute to rejuvenating the performance of
low-bit quantized LLMs and present substantial advancements in the field of
network binarization for LLMs.The code is available at
https://github.com/hahnyuan/BinaryLLM.
Related papers
- Scaling laws for post-training quantized large language models [41.78467383320145]
Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size.
The quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice.
arXiv Detail & Related papers (2024-10-15T23:34:22Z) - LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices [41.17378536966264]
Low-Rank Quantization $-$ is a simple yet effective post-training weight quantization method for large language models.
Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights.
We show the superiority of LRQ over prior LLM PTQ works under (i) $8$-bit weight and per-tensor activation quantization, (ii) $4$-bit weight and $8$-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes.
arXiv Detail & Related papers (2024-07-16T09:32:07Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z) - Memory-Efficient Fine-Tuning of Compressed Large Language Models via
sub-4-bit Integer Quantization [27.79783067245817]
Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs.
This paper presents Efficient Adaptation and Quantization-aware (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs.
arXiv Detail & Related papers (2023-05-23T15:20:01Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Low-bit Quantization of Recurrent Neural Network Language Models Using
Alternating Direction Methods of Multipliers [67.688697838109]
This paper presents a novel method to train quantized RNNLMs from scratch using alternating direction methods of multipliers (ADMM)
Experiments on two tasks suggest the proposed ADMM quantization achieved a model size compression factor of up to 31 times over the full precision baseline RNNLMs.
arXiv Detail & Related papers (2021-11-29T09:30:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.