DB-LLM: Accurate Dual-Binarization for Efficient LLMs
- URL: http://arxiv.org/abs/2402.11960v1
- Date: Mon, 19 Feb 2024 09:04:30 GMT
- Title: DB-LLM: Accurate Dual-Binarization for Efficient LLMs
- Authors: Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xiabin Zhou, Yifu
Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, Dacheng Tao
- Abstract summary: Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
- Score: 83.70686728471547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have significantly advanced the field of natural
language processing, while the expensive memory and computation consumption
impede their practical deployment. Quantization emerges as one of the most
effective methods for improving the computational efficiency of LLMs. However,
existing ultra-low-bit quantization always causes severe accuracy drops. In
this paper, we empirically relieve the micro and macro characteristics of
ultra-low bit quantization and present a novel Dual-Binarization method for
LLMs, namely DB-LLM. For the micro-level, we take both the accuracy advantage
of 2-bit-width and the efficiency advantage of binarization into account,
introducing Flexible Dual Binarization (FDB). By splitting 2-bit quantized
weights into two independent sets of binaries, FDB ensures the accuracy of
representations and introduces flexibility, utilizing the efficient bitwise
operations of binarization while retaining the inherent high sparsity of
ultra-low bit quantization. For the macro-level, we find the distortion that
exists in the prediction of LLM after quantization, which is specified as the
deviations related to the ambiguity of samples. We propose the Deviation-Aware
Distillation (DAD) method, enabling the model to focus differently on various
samples. Comprehensive experiments show that our DB-LLM not only significantly
surpasses the current State-of-The-Art (SoTA) in ultra-low bit quantization
(eg, perplexity decreased from 9.64 to 7.23), but also achieves an additional
20\% reduction in computational consumption compared to the SOTA method under
the same bit-width. Our code will be released soon.
Related papers
- TernaryLLM: Ternarized Large Language Model [29.29122031050894]
Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks.
We introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable.
We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization.
arXiv Detail & Related papers (2024-06-11T11:40:12Z) - BiSup: Bidirectional Quantization Error Suppression for Large Language Models [13.042992673384466]
We introduce BiSup, a Bi-directional quantization error Suppression method.
We show that BiSup can improve performance over two state-of-the-art methods.
arXiv Detail & Related papers (2024-05-24T08:39:27Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - Characterizing the Accuracy - Efficiency Trade-off of Low-rank Decomposition in Language Models [1.530997923234786]
Large language models (LLMs) have emerged and presented their general problem-solving capabilities with one model.
We formalize the low-rank decomposition design space and show that the decomposition design space is enormous.
Results show that we can achieve a 9% model size reduction with minimal accuracy drops.
arXiv Detail & Related papers (2024-05-10T17:40:02Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.
Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.