Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers
- URL: http://arxiv.org/abs/2403.19591v2
- Date: Fri, 29 Mar 2024 14:13:11 GMT
- Title: Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers
- Authors: Pingcheng Dong, Yonghao Tan, Dong Zhang, Tianwei Ni, Xuejiao Liu, Yu Liu, Peng Luo, Luhong Liang, Shih-Yang Liu, Xijie Huang, Huaiyu Zhu, Yun Pan, Fengwei An, Kwang-Ting Cheng,
- Abstract summary: Non-linear functions are prevalent in Transformers and their lightweight variants, incurring substantial and frequently underestimated hardware costs.
Previous state-of-the-art works optimize these operations by piece-wise linear approximation and store the parameters in look-up tables (LUT)
This paper proposed a genetic LUT-Approximation algorithm namely GQA-LUT that can automatically determine the parameters with quantization awareness.
- Score: 26.62171477561166
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-linear functions are prevalent in Transformers and their lightweight variants, incurring substantial and frequently underestimated hardware costs. Previous state-of-the-art works optimize these operations by piece-wise linear approximation and store the parameters in look-up tables (LUT), but most of them require unfriendly high-precision arithmetics such as FP/INT 32 and lack consideration of integer-only INT quantization. This paper proposed a genetic LUT-Approximation algorithm namely GQA-LUT that can automatically determine the parameters with quantization awareness. The results demonstrate that GQA-LUT achieves negligible degradation on the challenging semantic segmentation task for both vanilla and linear Transformer models. Besides, proposed GQA-LUT enables the employment of INT8-based LUT-Approximation that achieves an area savings of 81.3~81.7% and a power reduction of 79.3~80.2% compared to the high-precision FP/INT 32 alternatives. Code is available at https:// github.com/PingchengDong/GQA-LUT.
Related papers
- Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.
PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.
Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - Toward Capturing Genetic Epistasis From Multivariate Genome-Wide Association Studies Using Mixed-Precision Kernel Ridge Regression [4.356528958652799]
We boost the performance of output accuracy-preserving mixed-precision computation of Genome-Wide Association Studies (GWAS) of 305K patients from the UK BioBank.
Tile-centric adaptive-precision linear algebraic techniques motivated by reducing data motion gain enhanced significance with low-precision GPU arithmetic.
We deploy a new four-precision Cholesky-based solver, which, at 1.805 mixed-precision ExaOp/s on a nearly full Alps system, outperforms the state-of-the-art CPU-only REGENIE GWAS software by five orders of magnitude.
arXiv Detail & Related papers (2024-09-03T08:50:42Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.
We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.
EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer [0.0]
We propose Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer (LATTE)
LATTE employs a headwise threshold-based filter with the low-precision dot product to reduce the computation of Multi-Head Attention (MHA)
Experimental results indicate LATTE can smoothly adapt to both NLP and CV tasks, offering significant computation savings.
arXiv Detail & Related papers (2024-04-11T07:23:19Z) - FP8-BERT: Post-Training Quantization for Transformer [20.51143486483669]
Transformer-based models, such as BERT, require massive memory storage and inference cost when deployed in production.
New numeric format FP8 has been proposed and supported in commercial AI computing platforms such as H100.
We empirically validate the effectiveness of FP8 as a way to do Post-Training Quantization without significant loss of accuracy.
arXiv Detail & Related papers (2023-12-10T02:14:34Z) - Sparse Universal Transformer [64.78045820484299]
The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers.
This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism.
arXiv Detail & Related papers (2023-10-11T00:38:57Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - NN-LUT: Neural Approximation of Non-Linear Operations for Efficient
Transformer Inference [9.329021390526124]
Non-linear operations such as GELU, Layer normalization, and Softmax are essential yet costly building blocks of Transformer models.
This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference.
arXiv Detail & Related papers (2021-12-03T23:06:57Z) - Statistically Meaningful Approximation: a Case Study on Approximating
Turing Machines with Transformers [50.85524803885483]
This work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability.
We study SM approximation for two function classes: circuits and Turing machines.
arXiv Detail & Related papers (2021-07-28T04:28:55Z) - Implicit differentiation of Lasso-type models for hyperparameter
optimization [82.73138686390514]
We introduce an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lasso-type problems.
Our approach scales to high-dimensional data by leveraging the sparsity of the solutions.
arXiv Detail & Related papers (2020-02-20T18:43:42Z) - Learning Accurate Integer Transformer Machine-Translation Models [0.05184427980355132]
We describe a method for training accurate Transformer machine-translation models to run inference using 8-bit integer (INT8) hardware matrix multipliers.
Our approach converts all matrix-multiplication tensors from an existing FP32 model into INT8 tensors by automatically making range-precision trade-offs during training.
arXiv Detail & Related papers (2020-01-03T18:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.