An Investigation on Different Underlying Quantization Schemes for
Pre-trained Language Models
- URL: http://arxiv.org/abs/2010.07109v1
- Date: Wed, 14 Oct 2020 14:05:06 GMT
- Title: An Investigation on Different Underlying Quantization Schemes for
Pre-trained Language Models
- Authors: Zihan Zhao, Yuncong Liu, Lu Chen, Qi Liu, Rao Ma and Kai Yu
- Abstract summary: We implement k-means quantization and compare its performance on the fix-precision quantization of BERT with linear quantization.
We also compare the two quantization schemes on ALBERT models to explore the robustness differences between different pre-trained models.
- Score: 33.49417100179159
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, pre-trained language models like BERT have shown promising
performance on multiple natural language processing tasks. However, the
application of these models has been limited due to their huge size. To reduce
its size, a popular and efficient way is quantization. Nevertheless, most of
the works focusing on BERT quantization adapted primary linear clustering as
the quantization scheme, and few works try to upgrade it. That limits the
performance of quantization significantly. In this paper, we implement k-means
quantization and compare its performance on the fix-precision quantization of
BERT with linear quantization. Through the comparison, we verify that the
effect of the underlying quantization scheme upgrading is underestimated and
there is a huge development potential of k-means quantization. Besides, we also
compare the two quantization schemes on ALBERT models to explore the robustness
differences between different pre-trained models.
Related papers
- Scaling Laws for Mixed quantization in Large Language Models [10.912306313183972]
Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the computational requirements for running inference on these models.
In this study, we focus on a straightforward question: When aiming for a specific accuracy or perplexity target for low-precision quantization, how many high-precision numbers or calculations are required to preserve as we scale LLMs to larger sizes?
arXiv Detail & Related papers (2024-10-09T09:45:01Z) - Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation [70.22782550540714]
Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
arXiv Detail & Related papers (2024-08-07T12:42:09Z) - When Quantization Affects Confidence of Large Language Models? [4.338589334157708]
We show that GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models.
We propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.
arXiv Detail & Related papers (2024-05-01T16:58:28Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - Zero-shot Adversarial Quantization [11.722728148523366]
We propose a zero-shot adversarial quantization (ZAQ) framework, facilitating effective discrepancy estimation and knowledge transfer.
This is achieved by a novel two-level discrepancy modeling to drive a generator to synthesize informative and diverse data examples.
We conduct extensive experiments on three fundamental vision tasks, demonstrating the superiority of ZAQ over the strong zero-shot baselines.
arXiv Detail & Related papers (2021-03-29T01:33:34Z) - Adaptive Quantization of Model Updates for Communication-Efficient
Federated Learning [75.45968495410047]
Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning.
Gradient quantization is an effective way of reducing the number of bits required to communicate each model update.
We propose an adaptive quantization strategy called AdaFL that aims to achieve communication efficiency as well as a low error floor.
arXiv Detail & Related papers (2021-02-08T19:14:21Z) - KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with
Learned Step Size Quantization [1.9786767260073905]
transformer-based language models such as BERT have shown tremendous performance improvement for a range of natural language processing tasks.
We propose a novel quantization method named KDLSQ-BERT that combines knowledge distillation (KD) with learned step size quantization (LSQ) for language model quantization.
arXiv Detail & Related papers (2021-01-15T02:21:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.