KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with
Learned Step Size Quantization
- URL: http://arxiv.org/abs/2101.05938v1
- Date: Fri, 15 Jan 2021 02:21:28 GMT
- Title: KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with
Learned Step Size Quantization
- Authors: Jing Jin, Cai Liang, Tiancheng Wu, Liqin Zou, Zhiliang Gan
- Abstract summary: transformer-based language models such as BERT have shown tremendous performance improvement for a range of natural language processing tasks.
We propose a novel quantization method named KDLSQ-BERT that combines knowledge distillation (KD) with learned step size quantization (LSQ) for language model quantization.
- Score: 1.9786767260073905
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently, transformer-based language models such as BERT have shown
tremendous performance improvement for a range of natural language processing
tasks. However, these language models usually are computation expensive and
memory intensive during inference. As a result, it is difficult to deploy them
on resource-restricted devices. To improve the inference performance, as well
as reduce the model size while maintaining the model accuracy, we propose a
novel quantization method named KDLSQ-BERT that combines knowledge distillation
(KD) with learned step size quantization (LSQ) for language model quantization.
The main idea of our method is that the KD technique is leveraged to transfer
the knowledge from a "teacher" model to a "student" model when exploiting LSQ
to quantize that "student" model during the quantization training process.
Extensive experiment results on GLUE benchmark and SQuAD demonstrate that our
proposed KDLSQ-BERT not only performs effectively when doing different bit
(e.g. 2-bit $\sim$ 8-bit) quantization, but also outperforms the existing BERT
quantization methods, and even achieves comparable performance as the
full-precision base-line model while obtaining 14.9x compression ratio. Our
code will be public available.
Related papers
- GWQ: Gradient-Aware Weight Quantization for Large Language Models [61.17678373122165]
gradient-aware weight quantization (GWQ) is the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers.
GWQ retains the corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format.
In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.
arXiv Detail & Related papers (2024-10-30T11:16:04Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.
We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.
EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z) - Self-Distilled Quantization: Achieving High Compression Rates in
Transformer-Based Language Models [6.936564049727831]
We present a new method called self-distilled quantization (SDQ) that minimizes accumulative quantization errors and outperforms baselines.
We apply SDQ to multilingual models XLM-R-Base and InfoXLM-Base and demonstrate that both models can be reduced from 32-bit floating point weights to 8-bit integer weights.
arXiv Detail & Related papers (2023-07-12T07:38:24Z) - LLM-QAT: Data-Free Quantization Aware Training for Large Language Models [38.76165207636793]
We propose a data-free distillation method that leverages generations produced by the pre-trained model.
In addition to quantizing weights and activations, we also quantize the KV cache, which is critical for increasing throughput.
We experiment with LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits.
arXiv Detail & Related papers (2023-05-29T05:22:11Z) - RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models [14.07649230604283]
We propose low complexity changes to the quantization aware training (QAT) process to improve model accuracy.
With the improved accuracy, it opens up the possibility to exploit some of the other benefits of noise based QAT.
arXiv Detail & Related papers (2023-05-24T19:45:56Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - An Investigation on Different Underlying Quantization Schemes for
Pre-trained Language Models [33.49417100179159]
We implement k-means quantization and compare its performance on the fix-precision quantization of BERT with linear quantization.
We also compare the two quantization schemes on ALBERT models to explore the robustness differences between different pre-trained models.
arXiv Detail & Related papers (2020-10-14T14:05:06Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.