EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for
the Acceleration of Lightweight LLMs on the Edge
- URL: http://arxiv.org/abs/2402.10787v1
- Date: Fri, 16 Feb 2024 16:10:38 GMT
- Title: EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for
the Acceleration of Lightweight LLMs on the Edge
- Authors: Xuan Shen, Zhenglun Kong, Changdi Yang, Zhaoyang Han, Lei Lu, Peiyan
Dong, Cheng Lyu, Chih-hsiang Li, Xuehang Guo, Zhihao Shu, Wei Niu, Miriam
Leeser, Pu Zhao, Yanzhi Wang
- Abstract summary: Post-Training Quantization (PTQ) methods degrade in quality when quantizing weights, activations, and KV cache together to below 8 bits.
Many Quantization-Aware Training (QAT) works quantize model weights, leaving the activations untouched, which do not fully exploit the potential of quantization for inference acceleration on the edge.
We propose EdgeQAT, the Entropy and Distribution Guided QAT for the optimization of lightweight LLMs to achieve inference acceleration on Edge devices.
- Score: 40.85258685379659
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the remarkable strides of Large Language Models (LLMs) in various
fields, the wide applications of LLMs on edge devices are limited due to their
massive parameters and computations. To address this, quantization is commonly
adopted to generate lightweight LLMs with efficient computations and fast
inference. However, Post-Training Quantization (PTQ) methods dramatically
degrade in quality when quantizing weights, activations, and KV cache together
to below 8 bits. Besides, many Quantization-Aware Training (QAT) works quantize
model weights, leaving the activations untouched, which do not fully exploit
the potential of quantization for inference acceleration on the edge. In this
paper, we propose EdgeQAT, the Entropy and Distribution Guided QAT for the
optimization of lightweight LLMs to achieve inference acceleration on Edge
devices. We first identify that the performance drop of quantization primarily
stems from the information distortion in quantized attention maps, demonstrated
by the different distributions in quantized query and key of the self-attention
mechanism. Then, the entropy and distribution guided QAT is proposed to
mitigate the information distortion. Moreover, we design a token
importance-aware adaptive method to dynamically quantize the tokens with
different bit widths for further optimization and acceleration. Our extensive
experiments verify the substantial improvements with our framework across
various datasets. Furthermore, we achieve an on-device speedup of up to 2.37x
compared with its FP16 counterparts across multiple edge devices, signaling a
groundbreaking advancement.
Related papers
- QSpec: Speculative Decoding with Complementary Quantization Schemes [37.007621357142725]
Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models.
We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding.
QSPEC empirically boosts token generation throughput by up to 1.80x without any quality compromise.
arXiv Detail & Related papers (2024-10-15T05:57:51Z) - AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community.
We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z) - OutlierTune: Efficient Channel-Wise Quantization for Large Language Models [24.645237670811476]
OutlierTune is an efficient per-channel post-training quantization method for the activations of large language models.
The proposed framework is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference.
arXiv Detail & Related papers (2024-06-27T02:02:26Z) - PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks [4.827161693957252]
Non-quantized elementwise operations dominate the inference cost of low-precision models.
PikeLPN model addresses these issues by applying quantization to both elementwise operations and multiply-accumulate operations.
arXiv Detail & Related papers (2024-03-29T18:23:34Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs
on the Edge [45.690907522226794]
Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks.
Recent works show that 8-bit or lower weight quantization is feasible with minimal impact on end-to-end task performance.
We propose Agile-Quant, an activation-guided quantization framework for popular Large Language Models.
arXiv Detail & Related papers (2023-12-09T22:12:52Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - QuantEase: Optimization-based Quantization for Language Models [17.333778751252392]
This work introduces Quantization (PTQ) of various quantization layers from recent advances of Large Language Models (LLMs)
Our CD-based approach features straightforward updates, relying solely on vector operations.
We also explore an outlier approach, allowing for retaining significant weights (outoutliers) with complete precision.
arXiv Detail & Related papers (2023-09-05T01:39:09Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.