Towards Efficient Post-training Quantization of Pre-trained Language
Models
- URL: http://arxiv.org/abs/2109.15082v1
- Date: Thu, 30 Sep 2021 12:50:06 GMT
- Title: Towards Efficient Post-training Quantization of Pre-trained Language
Models
- Authors: Haoli Bai, Lu Hou, Lifeng Shang, Xin Jiang, Irwin King, Michael R. Lyu
- Abstract summary: We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
- Score: 85.68317334241287
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Network quantization has gained increasing attention with the rapid growth of
large pre-trained language models~(PLMs). However, most existing quantization
methods for PLMs follow quantization-aware training~(QAT) that requires
end-to-end training with full access to the entire dataset. Therefore, they
suffer from slow training, large memory overhead, and data security issues. In
this paper, we study post-training quantization~(PTQ) of PLMs, and propose
module-wise quantization error minimization~(MREM), an efficient solution to
mitigate these issues. By partitioning the PLM into multiple modules, we
minimize the reconstruction error incurred by quantization for each module. In
addition, we design a new model parallel training strategy such that each
module can be trained locally on separate computing devices without waiting for
preceding modules, which brings nearly the theoretical training speed-up (e.g.,
$4\times$ on $4$ GPUs). Experiments on GLUE and SQuAD benchmarks show that our
proposed PTQ solution not only performs close to QAT, but also enjoys
significant reductions in training time, memory overhead, and data consumption.
Related papers
- Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs)
During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.
Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.
We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.
EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models [3.3484462092188005]
We introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and state shards remain immutable for extended periods of time.
The results show up to 48$times$ faster checkpointing and 2.2$times$ faster end-to-end training compared with state-of-art checkpointing approaches.
arXiv Detail & Related papers (2024-06-15T18:30:40Z) - Low-Rank Quantization-Aware Training for LLMs [8.535254310145005]
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands.
We propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs.
Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage.
arXiv Detail & Related papers (2024-06-10T15:44:22Z) - One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments [43.107261545706415]
Large Language Models (LLMs) have advanced rapidly but face significant memory demands.
Current methods typically require lengthy training to alleviate the performance degradation from quantization loss.
We make an initial attempt to extend the once-for-all framework to large language models.
arXiv Detail & Related papers (2024-05-30T16:05:15Z) - Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes.
Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts.
We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z) - LLM-QAT: Data-Free Quantization Aware Training for Large Language Models [38.76165207636793]
We propose a data-free distillation method that leverages generations produced by the pre-trained model.
In addition to quantizing weights and activations, we also quantize the KV cache, which is critical for increasing throughput.
We experiment with LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits.
arXiv Detail & Related papers (2023-05-29T05:22:11Z) - Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency.
We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training.
We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z) - Modular Quantization-Aware Training for 6D Object Pose Estimation [52.9436648014338]
Edge applications demand efficient 6D object pose estimation on resource-constrained embedded platforms.
We introduce Modular Quantization-Aware Training (MQAT), an adaptive and mixed-precision quantization-aware training strategy.
MQAT guides a systematic gradated modular quantization sequence and determines module-specific bit precisions, leading to quantized models that outperform those produced by state-of-the-art uniform and mixed-precision quantization techniques.
arXiv Detail & Related papers (2023-03-12T21:01:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.