Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs
on the Edge
- URL: http://arxiv.org/abs/2312.05693v1
- Date: Sat, 9 Dec 2023 22:12:52 GMT
- Title: Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs
on the Edge
- Authors: Xuan Shen, Peiyan Dong, Lei Lu, Zhenglun Kong, Zhengang Li, Ming Lin,
Chao Wu, Yanzhi Wang
- Abstract summary: Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks.
Recent works show that 8-bit or lower weight quantization is feasible with minimal impact on end-to-end task performance.
We propose Agile-Quant, an activation-guided quantization framework for popular Large Language Models.
- Score: 45.690907522226794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) stand out for their impressive performance in
intricate language modeling tasks. However, their demanding computational and
memory needs pose obstacles for broad use on edge devices. Quantization is then
introduced to boost LLMs' on-device efficiency. Recent works show that 8-bit or
lower weight quantization is feasible with minimal impact on end-to-end task
performance, while the activation is still not quantized. On the other hand,
mainstream commodity edge devices still struggle to execute these sub-8-bit
quantized networks effectively. In this paper, we propose Agile-Quant, an
activation-guided quantization framework for popular Large Language Models
(LLMs), and implement an end-to-end accelerator on multiple edge devices for
faster inference. Considering the hardware profiling and activation analysis,
we first introduce a basic activation quantization strategy to balance the
trade-off of task performance and real inference speed. Then we leverage the
activation-aware token pruning technique to reduce the outliers and the adverse
impact on attentivity. Ultimately, we utilize the SIMD-based 4-bit multiplier
and our efficient TRIP matrix multiplication to implement the accelerator for
LLMs on the edge. We apply our framework on different scales of LLMs including
LLaMA, OPT, and BLOOM with 4-bit or 8-bit for the activation and 4-bit for the
weight quantization. Experiments show that Agile-Quant achieves simultaneous
quantization of model weights and activations while maintaining task
performance comparable to existing weight-only quantization methods. Moreover,
in the 8- and 4-bit scenario, Agile-Quant achieves an on-device speedup of up
to 2.55x compared to its FP16 counterparts across multiple edge devices,
marking a pioneering advancement in this domain.
Related papers
- "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - MobileQuant: Mobile-friendly Quantization for On-device Language Models [31.75012542498791]
Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications.
deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs.
We introduce a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works.
arXiv Detail & Related papers (2024-08-25T20:41:22Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.
Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for
the Acceleration of Lightweight LLMs on the Edge [40.85258685379659]
Post-Training Quantization (PTQ) methods degrade in quality when quantizing weights, activations, and KV cache together to below 8 bits.
Many Quantization-Aware Training (QAT) works quantize model weights, leaving the activations untouched, which do not fully exploit the potential of quantization for inference acceleration on the edge.
We propose EdgeQAT, the Entropy and Distribution Guided QAT for the optimization of lightweight LLMs to achieve inference acceleration on Edge devices.
arXiv Detail & Related papers (2024-02-16T16:10:38Z) - Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization [12.655230451207956]
This paper focuses on post-training quantization (PTQ) in Large Language Models (LLMs)
We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC)
We demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models.
arXiv Detail & Related papers (2023-11-09T06:19:51Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM [6.85331857224501]
Large Language Models (LLMs) pose significant hardware challenges related to memory requirements and computational ability.
There are two mainstream quantization schemes for LLMs: coarse-grained ($textite.g.,$ channel-wise) quantization and fine-grained ($textite.g.,$ group-wise) quantization.
We introduce Dual Grained Quantization (DGQ), a novel A8W4 quantization for LLM that maintains superior performance while ensuring fast inference speed.
arXiv Detail & Related papers (2023-10-07T14:50:28Z) - FPTQ: Fine-grained Post-Training Quantization for Large Language Models [28.11564378745513]
We propose a novel W4A8 post-training quantization method for the available open-sourced LLMs.
We obtain the state-of-the-art W4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on standard benchmarks.
arXiv Detail & Related papers (2023-08-30T12:18:18Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.