BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via
Self-Distillation
- URL: http://arxiv.org/abs/2402.10631v1
- Date: Fri, 16 Feb 2024 12:27:15 GMT
- Title: BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via
Self-Distillation
- Authors: Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu,
Ningyi Xu
- Abstract summary: BitDistiller is a framework that synergizes Quantization-Aware Training (QAT) with Knowledge Distillation (KD) to boost the performance of Large Language Models (LLMs)
Specifically, BitDistiller first incorporates a tailored asymmetric quantization and clipping technique to maximally preserve the fidelity of quantized weights, and then proposes a novel Confidence-Aware Kullback-Leibler Divergence (CAKLD) objective.
Empirical evaluations demonstrate that BitDistiller significantly surpasses existing methods in both 3-bit and 2-bit configurations on general language understanding and complex reasoning benchmarks.
- Score: 13.262366437264188
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The upscaling of Large Language Models (LLMs) has yielded impressive advances
in natural language processing, yet it also poses significant deployment
challenges. Weight quantization has emerged as a widely embraced solution to
reduce memory and computational demands. This paper introduces BitDistiller, a
framework that synergizes Quantization-Aware Training (QAT) with Knowledge
Distillation (KD) to boost the performance of LLMs at ultra-low precisions
(sub-4-bit). Specifically, BitDistiller first incorporates a tailored
asymmetric quantization and clipping technique to maximally preserve the
fidelity of quantized weights, and then proposes a novel Confidence-Aware
Kullback-Leibler Divergence (CAKLD) objective, which is employed in a
self-distillation manner to enable faster convergence and superior model
performance. Empirical evaluations demonstrate that BitDistiller significantly
surpasses existing methods in both 3-bit and 2-bit configurations on general
language understanding and complex reasoning benchmarks. Notably, BitDistiller
is shown to be more cost-effective, demanding fewer data and training
resources. The code is available at https://github.com/DD-DuDa/BitDistiller.
Related papers
- BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration [7.774285511386959]
Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks.
Yet the substantial memory footprint of LLMs significantly hinders their deployment.
We improve the accessibility of LLMs through BitMoD, an algorithm- hardware co-design solution.
arXiv Detail & Related papers (2024-11-18T17:16:58Z) - Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning.
deploying LLM inference poses challenges due to the high compute and memory requirements.
We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv Detail & Related papers (2024-06-16T09:51:55Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models.
LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods.
LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with
Learned Step Size Quantization [1.9786767260073905]
transformer-based language models such as BERT have shown tremendous performance improvement for a range of natural language processing tasks.
We propose a novel quantization method named KDLSQ-BERT that combines knowledge distillation (KD) with learned step size quantization (LSQ) for language model quantization.
arXiv Detail & Related papers (2021-01-15T02:21:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.