PB-LLM: Partially Binarized Large Language Models
- URL: http://arxiv.org/abs/2310.00034v2
- Date: Tue, 7 Nov 2023 20:41:22 GMT
- Title: PB-LLM: Partially Binarized Large Language Models
- Authors: Yuzhang Shang, Zhihang Yuan, Qiang Wu, Zhen Dong
- Abstract summary: This paper explores network binarization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression.
We propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs.
- Score: 14.244537605866864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper explores network binarization, a radical form of quantization,
compressing model weights to a single bit, specifically for Large Language
Models (LLMs) compression. Due to previous binarization methods collapsing
LLMs, we propose a novel approach, Partially-Binarized LLM (PB-LLM), which can
achieve extreme low-bit quantization while maintaining the linguistic reasoning
capacity of quantized LLMs. Specifically, our exploration first uncovers the
ineffectiveness of naive applications of existing binarization algorithms and
highlights the imperative role of salient weights in achieving low-bit
quantization. Thus, PB-LLM filters a small ratio of salient weights during
binarization, allocating them to higher-bit storage, i.e.,
partially-binarization. PB-LLM is extended to recover the capacities of
quantized LMMs, by analyzing from the perspective of post-training quantization
(PTQ) and quantization-aware training (QAT). Under PTQ, combining the concepts
from GPTQ, we reconstruct the binarized weight matrix guided by the Hessian
matrix and successfully recover the reasoning capacity of PB-LLM in low-bit.
Under QAT, we freeze the salient weights during training, explore the
derivation of optimal scaling factors crucial for minimizing the quantization
error, and propose a scaling mechanism based on this derived scaling strategy
for residual binarized weights. Those explorations and the developed
methodologies significantly contribute to rejuvenating the performance of
low-bit quantized LLMs and present substantial advancements in the field of
network binarization for LLMs.The code is available at
https://github.com/hahnyuan/BinaryLLM.
Related papers
- PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.
We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.
Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [95.32315448601241]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)
RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.
Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Scaling Laws for Post Training Quantized Large Language Models [41.78467383320145]
Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size.
The quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice.
arXiv Detail & Related papers (2024-10-15T23:34:22Z) - LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices [41.17378536966264]
Low-Rank Quantization (LRQ) reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices.
Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights.
We show the superiority of LRQ over prior LLM PTQ works under (i) 8-bit weight and per-tensor activation quantization, (ii) 4-bit weight and 8-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes.
arXiv Detail & Related papers (2024-07-16T09:32:07Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z) - Memory-Efficient Fine-Tuning of Compressed Large Language Models via
sub-4-bit Integer Quantization [27.79783067245817]
Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs.
This paper presents Efficient Adaptation and Quantization-aware (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs.
arXiv Detail & Related papers (2023-05-23T15:20:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.