Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers
- URL: http://arxiv.org/abs/2403.19591v2
- Date: Fri, 29 Mar 2024 14:13:11 GMT
- Title: Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers
- Authors: Pingcheng Dong, Yonghao Tan, Dong Zhang, Tianwei Ni, Xuejiao Liu, Yu Liu, Peng Luo, Luhong Liang, Shih-Yang Liu, Xijie Huang, Huaiyu Zhu, Yun Pan, Fengwei An, Kwang-Ting Cheng,
- Abstract summary: Non-linear functions are prevalent in Transformers and their lightweight variants, incurring substantial and frequently underestimated hardware costs.
Previous state-of-the-art works optimize these operations by piece-wise linear approximation and store the parameters in look-up tables (LUT)
This paper proposed a genetic LUT-Approximation algorithm namely GQA-LUT that can automatically determine the parameters with quantization awareness.
- Score: 26.62171477561166
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-linear functions are prevalent in Transformers and their lightweight variants, incurring substantial and frequently underestimated hardware costs. Previous state-of-the-art works optimize these operations by piece-wise linear approximation and store the parameters in look-up tables (LUT), but most of them require unfriendly high-precision arithmetics such as FP/INT 32 and lack consideration of integer-only INT quantization. This paper proposed a genetic LUT-Approximation algorithm namely GQA-LUT that can automatically determine the parameters with quantization awareness. The results demonstrate that GQA-LUT achieves negligible degradation on the challenging semantic segmentation task for both vanilla and linear Transformer models. Besides, proposed GQA-LUT enables the employment of INT8-based LUT-Approximation that achieves an area savings of 81.3~81.7% and a power reduction of 79.3~80.2% compared to the high-precision FP/INT 32 alternatives. Code is available at https:// github.com/PingchengDong/GQA-LUT.
Related papers
- LiFT: Lightweight, FPGA-tailored 3D object detection based on LiDAR data [0.5461938536945721]
LiFT is a lightweight, fully quantized 3D object detection algorithm for LiDAR data optimized for real-time inference on FPGA platforms.
With a computational cost of just 20.73 GMACs, LiFT stands out as one of the few algorithms targeting minimal-complexity 3D object detection.
arXiv Detail & Related papers (2025-01-19T20:15:13Z) - Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors [7.630782404476683]
We introduce a novel tokenization approach, referred to as Sensor Snapshot Tokenization (SST), which preserves variable-specific representations of power delay profile ( PDP)
We also propose a lightweight Swish-Gated Linear Unit-based Transformer (L-SwiGLU Transformer) model, designed to reduce computational complexity without compromising localization accuracy.
arXiv Detail & Related papers (2025-01-14T01:16:30Z) - Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.
PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.
Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.
We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.
EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer [0.0]
We propose Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer (LATTE)
LATTE employs a headwise threshold-based filter with the low-precision dot product to reduce the computation of Multi-Head Attention (MHA)
Experimental results indicate LATTE can smoothly adapt to both NLP and CV tasks, offering significant computation savings.
arXiv Detail & Related papers (2024-04-11T07:23:19Z) - FP8-BERT: Post-Training Quantization for Transformer [20.51143486483669]
Transformer-based models, such as BERT, require massive memory storage and inference cost when deployed in production.
New numeric format FP8 has been proposed and supported in commercial AI computing platforms such as H100.
We empirically validate the effectiveness of FP8 as a way to do Post-Training Quantization without significant loss of accuracy.
arXiv Detail & Related papers (2023-12-10T02:14:34Z) - Sparse Universal Transformer [64.78045820484299]
The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers.
This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism.
arXiv Detail & Related papers (2023-10-11T00:38:57Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - Statistically Meaningful Approximation: a Case Study on Approximating
Turing Machines with Transformers [50.85524803885483]
This work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability.
We study SM approximation for two function classes: circuits and Turing machines.
arXiv Detail & Related papers (2021-07-28T04:28:55Z) - Implicit differentiation of Lasso-type models for hyperparameter
optimization [82.73138686390514]
We introduce an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lasso-type problems.
Our approach scales to high-dimensional data by leveraging the sparsity of the solutions.
arXiv Detail & Related papers (2020-02-20T18:43:42Z) - Learning Accurate Integer Transformer Machine-Translation Models [0.05184427980355132]
We describe a method for training accurate Transformer machine-translation models to run inference using 8-bit integer (INT8) hardware matrix multipliers.
Our approach converts all matrix-multiplication tensors from an existing FP32 model into INT8 tensors by automatically making range-precision trade-offs during training.
arXiv Detail & Related papers (2020-01-03T18:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.