A Speed Odyssey for Deployable Quantization of LLMs
- URL: http://arxiv.org/abs/2311.09550v1
- Date: Thu, 16 Nov 2023 04:11:19 GMT
- Title: A Speed Odyssey for Deployable Quantization of LLMs
- Authors: Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Liang Li, Yifan Lu,
Xiangxiang Chu, Yerui Sun, Yuchen Xie
- Abstract summary: We introduce a hardware-centric approach in the construction of quantization algorithms.
Our method, OdysseyLLM, comes with a novel W4A8 kernel implementation called FastGEMM and a combined recipe of quantization strategies.
Experiments manifest the superiority of our W4A8 method which brings the actual speed boosting up to textbf4$times$ compared to Hugging Face FP16 and textbf2.23$times$ vs. the state-of-art inference engine.
- Score: 19.12232212257625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The large language model era urges faster and less costly inference. Prior
model compression works on LLMs tend to undertake a software-centric approach
primarily focused on the simulated quantization performance. By neglecting the
feasibility of deployment, these approaches are typically disabled in real
practice. They used to drastically push down the quantization bit range for a
reduced computation which might not be supported by the mainstream hardware, or
involve sophisticated algorithms that introduce extra computation or memory
access overhead. We argue that pursuing a hardware-centric approach in the
construction of quantization algorithms is crucial. In this regard, we are
driven to build our compression method on top of hardware awareness,
eliminating impractical algorithm choices while maximizing the benefit of
hardware acceleration. Our method, OdysseyLLM, comes with a novel W4A8 kernel
implementation called FastGEMM and a combined recipe of quantization
strategies. Extensive experiments manifest the superiority of our W4A8 method
which brings the actual speed boosting up to \textbf{4$\times$} compared to
Hugging Face FP16 inference and \textbf{2.23$\times$} vs. the state-of-the-art
inference engine TensorRT-LLM in FP16, and \textbf{1.45$\times$} vs.
TensorRT-LLM in INT8, yet without substantially harming the performance.
Related papers
- COMET: Towards Partical W4A4KV4 LLMs Serving [37.30529940231099]
Quantization is a compression technology to reduce the overhead of serving large language models (LLMs) on terminal devices and in cloud data centers.
We propose a novel mixed-precision quantization algorithm (FMPQ) that compresses most activations into 4-bit with negligible accuracy loss.
We integrate the optimized W4Ax kernel into our inference framework, COMET, and provide efficient management to support popular LLMs.
arXiv Detail & Related papers (2024-10-16T02:16:53Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models [9.444063879246242]
We introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM.
It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU.
arXiv Detail & Related papers (2024-08-16T06:39:08Z) - QQQ: Quality Quattuor-Bit Quantization for Large Language Models [22.61858069040346]
We present QQQ, a Quality Quattuor-bit Quantization method with 4-bit weights and 8-bit activations.
QQQ employs adaptive smoothing and Hessian-based compensation, significantly enhancing the performance of quantized models without extensive training.
Our specialized per-channel W4A8 GEMM and per-group W4A8 GEMM achieve impressive speed increases of 3.67$times$ and 3.29 $times$ over FP16 GEMM.
arXiv Detail & Related papers (2024-06-14T10:23:45Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM [6.85331857224501]
Large Language Models (LLMs) pose significant hardware challenges related to memory requirements and computational ability.
There are two mainstream quantization schemes for LLMs: coarse-grained ($textite.g.,$ channel-wise) quantization and fine-grained ($textite.g.,$ group-wise) quantization.
We introduce Dual Grained Quantization (DGQ), a novel A8W4 quantization for LLM that maintains superior performance while ensuring fast inference speed.
arXiv Detail & Related papers (2023-10-07T14:50:28Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.