DB-LLM: Accurate Dual-Binarization for Efficient LLMs
- URL: http://arxiv.org/abs/2402.11960v1
- Date: Mon, 19 Feb 2024 09:04:30 GMT
- Title: DB-LLM: Accurate Dual-Binarization for Efficient LLMs
- Authors: Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xiabin Zhou, Yifu
Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, Dacheng Tao
- Abstract summary: Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
- Score: 83.70686728471547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have significantly advanced the field of natural
language processing, while the expensive memory and computation consumption
impede their practical deployment. Quantization emerges as one of the most
effective methods for improving the computational efficiency of LLMs. However,
existing ultra-low-bit quantization always causes severe accuracy drops. In
this paper, we empirically relieve the micro and macro characteristics of
ultra-low bit quantization and present a novel Dual-Binarization method for
LLMs, namely DB-LLM. For the micro-level, we take both the accuracy advantage
of 2-bit-width and the efficiency advantage of binarization into account,
introducing Flexible Dual Binarization (FDB). By splitting 2-bit quantized
weights into two independent sets of binaries, FDB ensures the accuracy of
representations and introduces flexibility, utilizing the efficient bitwise
operations of binarization while retaining the inherent high sparsity of
ultra-low bit quantization. For the macro-level, we find the distortion that
exists in the prediction of LLM after quantization, which is specified as the
deviations related to the ambiguity of samples. We propose the Deviation-Aware
Distillation (DAD) method, enabling the model to focus differently on various
samples. Comprehensive experiments show that our DB-LLM not only significantly
surpasses the current State-of-The-Art (SoTA) in ultra-low bit quantization
(eg, perplexity decreased from 9.64 to 7.23), but also achieves an additional
20\% reduction in computational consumption compared to the SOTA method under
the same bit-width. Our code will be released soon.
Related papers
- Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.
PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.
Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs [28.70239743254508]
We present the first structural binarization method for LLM compression to less than 1-bit precision.
We observe that some weights in binarized LLMs can be randomly flipped without significant performance degradation.
Our approach performs better than other compressed binarization methods while significantly reducing memory requirements.
arXiv Detail & Related papers (2024-08-03T15:07:44Z) - TernaryLLM: Ternarized Large Language Model [29.29122031050894]
Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks.
We introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable.
We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization.
arXiv Detail & Related papers (2024-06-11T11:40:12Z) - BiSup: Bidirectional Quantization Error Suppression for Large Language Models [13.042992673384466]
We introduce BiSup, a Bi-directional quantization error Suppression method.
We show that BiSup can improve performance over two state-of-the-art methods.
arXiv Detail & Related papers (2024-05-24T08:39:27Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.