Related papers: SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision

SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision

URL: http://arxiv.org/abs/2209.09130v2
Date: Sun, 17 Dec 2023 16:54:55 GMT
Title: SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision
Authors: Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu, Weiquan Mao, Zhe Zhao, Kan Zhou
Abstract summary: We develop a toolkit for users to easily quantize their models for inference. Self-Adaptive Mixed-Precision (SAMP) is proposed to automatically control quantization rate by a mixed-precision architecture. Experimental results show that our SAMP toolkit has a higher speedup than PyTorch and FasterTransformer.
Score: 8.746249050302058
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The latest industrial inference engines, such as FasterTransformer and TurboTransformers, have verified that half-precision floating point (FP16) and 8-bit integer (INT8) quantization can greatly improve model inference speed. However, the existing INT8 quantization methods are too complicated, and improper usage will lead to model performance damage greatly. In this paper, we develop a toolkit for users to easily quantize their models for inference, in which Self-Adaptive Mixed-Precision (SAMP) is proposed to automatically control quantization rate by a mixed-precision architecture to balance model accuracy and efficiency. Experimental results show that our SAMP toolkit has a higher speedup than PyTorch and FasterTransformer while ensuring the required accuracy. In addition, SAMP is based on a modular design, decoupling the tokenizer, embedding, encoder and target layers, which allows users to handle various downstream tasks and can be seamlessly integrated into PyTorch.

Related papers

MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
Test-Time Model Adaptation with Only Forward Passes [68.11784295706995]
Test-time adaptation has proven effective in adapting a given trained model to unseen test samples with potential distribution shifts. We propose a test-time Forward-Optimization Adaptation (FOA) method. FOA runs on quantized 8-bit ViT, outperforms gradient-based TENT on full-precision 32-bit ViT, and achieves an up to 24-fold memory reduction on ImageNet-C.
arXiv Detail & Related papers (2024-04-02T05:34:33Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference [18.308180927492643]
ToP is a ranking-distilled token distillation technique, which distills effective token rankings from the final layer of unpruned models to early layers of pruned models. ToP reduces the average FLOPs of BERT by 8.1x while achieving competitive accuracy on GLUE, and provides a real latency speedup of up to 7.4x on an Intel CPU.
arXiv Detail & Related papers (2023-06-26T03:06:57Z)
Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization [7.392278887917975]
We propose a mixed-precision post training quantization approach that assigns different numerical precisions to tensors in a network based on their specific needs. Our experiments demonstrate latency reductions compared to a 16-bit baseline of $25.48%$, $21.69%$, and $33.28%$ respectively.
arXiv Detail & Related papers (2023-06-08T02:18:58Z)
Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model [0.0]
Excessive overhead leads to large latency and computational costs. We propose a model accelaration approaches for large language models. Our model achieves an 18x FLOPs speedup with an accuracy degradation of less than 8% compared to BERT.
arXiv Detail & Related papers (2023-05-21T13:30:56Z)
Patch-wise Mixed-Precision Quantization of Vision Transformer [2.3104000011280403]
Vision Transformers (ViTs) require complex self-attention computation to guarantee the learning of powerful feature representations. We propose a novel patch-wise mixed-precision quantization (PMQ) for efficient inference of ViTs.
arXiv Detail & Related papers (2023-05-11T04:34:10Z)
Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices. We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z)
Mixed Precision of Quantization of Transformer Language Models for Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications. Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors. The optimal local precision settings are automatically learned using two techniques. Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
AxFormer: Accuracy-driven Approximation of Transformers for Faster, Smaller and more Accurate NLP Models [4.247712017691596]
AxFormer is a framework that applies accuracy-driven approximations to create optimized transformer models for a given downstream task. Our experiments show that AxFormer models are up to 4.5% more accurate, while also being up to 2.5X faster and up to 3.2X smaller than conventional fine-tuned models.
arXiv Detail & Related papers (2020-10-07T23:29:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.