PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models
- URL: http://arxiv.org/abs/2306.00014v1
- Date: Tue, 30 May 2023 08:41:33 GMT
- Title: PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models
- Authors: Zhuocheng Gong, Jiahao Liu, Qifan Wang, Yang Yang, Jingang Wang, Wei
Wu, Yunsen Xian, Dongyan Zhao, Rui Yan
- Abstract summary: We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
- Score: 52.09865918265002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While transformer-based pre-trained language models (PLMs) have dominated a
number of NLP applications, these models are heavy to deploy and expensive to
use. Therefore, effectively compressing large-scale PLMs becomes an
increasingly important problem. Quantization, which represents high-precision
tensors with low-bit fix-point format, is a viable solution. However, most
existing quantization methods are task-specific, requiring customized training
and quantization with a large number of trainable parameters on each individual
task. Inspired by the observation that the over-parameterization nature of PLMs
makes it possible to freeze most of the parameters during the fine-tuning
stage, in this work, we propose a novel ``quantize before fine-tuning''
framework, PreQuant, that differs from both quantization-aware training and
post-training quantization. PreQuant is compatible with various quantization
strategies, with outlier-aware parameter-efficient fine-tuning incorporated to
correct the induced quantization error. We demonstrate the effectiveness of
PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5. We also provide an
empirical investigation into the workflow of PreQuant, which sheds light on its
efficacy.
Related papers
- Channel-Wise Mixed-Precision Quantization for Large Language Models [47.00361921910259]
Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks.
Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs.
We introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method.
arXiv Detail & Related papers (2024-10-16T21:34:41Z) - PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks [4.827161693957252]
Non-quantized elementwise operations dominate the inference cost of low-precision models.
PikeLPN model addresses these issues by applying quantization to both elementwise operations and multiply-accumulate operations.
arXiv Detail & Related papers (2024-03-29T18:23:34Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - RepQuant: Towards Accurate Post-Training Quantization of Large
Transformer Models via Scale Reparameterization [8.827794405944637]
Post-training quantization (PTQ) is a promising solution for compressing large transformer models.
Existing PTQ methods typically exhibit non-trivial performance loss.
We propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm.
arXiv Detail & Related papers (2024-02-08T12:35:41Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - Gradient-Based Post-Training Quantization: Challenging the Status Quo [23.1120983784623]
Quantization has become a crucial step for the efficient deployment of deep neural networks.
In this work, we show that the process is, to a certain extent, robust to a number of variables.
We derive a number of best practices for designing more efficient and scalable GPTQ methods.
arXiv Detail & Related papers (2023-08-15T09:25:11Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - An Investigation on Different Underlying Quantization Schemes for
Pre-trained Language Models [33.49417100179159]
We implement k-means quantization and compare its performance on the fix-precision quantization of BERT with linear quantization.
We also compare the two quantization schemes on ALBERT models to explore the robustness differences between different pre-trained models.
arXiv Detail & Related papers (2020-10-14T14:05:06Z) - Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization.
By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.