Related papers: INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

URL: http://arxiv.org/abs/2409.16997v2
Date: Thu, 26 Sep 2024 06:13:04 GMT
Title: INT-FlashAttention: Enabling Flash Attention for INT8 Quantization
Authors: Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su, Tong Yang,
Abstract summary: INT-FlashAttention is the first quantization architecture compatible with the forward workflow of FlashAttention. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels. Experimental results show INT-FlashAttention achieves 72% faster inference speed and 82% smaller quantization error compared to standard FlashAttention.
Score: 16.920037999454625
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-FlashAttention, the first INT8 quantization architecture compatible with the forward workflow of FlashAttention, which significantly improves the inference speed of FlashAttention on Ampere GPUs. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels, making it the first attention operator with fully INT8 input. As a general token-level post-training quantization framework, INT-FlashAttention is also compatible with other data formats like INT4, etc. Experimental results show INT-FlashAttention achieves 72% faster inference speed and 82% smaller quantization error compared to standard FlashAttention with FP16 and FP8 data format.

Related papers

FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression [76.01465333271229]
multimodal large language models (MLLMs) behave like a sloth in practical use. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called FlashSloth.
arXiv Detail & Related papers (2024-12-05T16:34:07Z)
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks. We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z)
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision [14.426543629408984]
Attention is the bottleneck for large language models and long-context applications. We develop three main techniques to speed up attention on GPUs. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPU by 1.5-2.0$times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization) and with FP8 reaching close to 1.2 PFLOPs/s.
arXiv Detail & Related papers (2024-07-11T15:44:48Z)
A Speed Odyssey for Deployable Quantization of LLMs [19.12232212257625]
We introduce a hardware-centric approach in the construction of quantization algorithms. Our method, OdysseyLLM, comes with a novel W4A8 kernel implementation called FastGEMM and a combined recipe of quantization strategies. Experiments manifest the superiority of our W4A8 method which brings the actual speed boosting up to textbf4$times$ compared to Hugging Face FP16 and textbf2.23$times$ vs. the state-of-art inference engine.
arXiv Detail & Related papers (2023-11-16T04:11:19Z)
ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers [38.03919998600518]
Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference. Existing solutions, such as ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook crucial memory-bounded operators and the complexities of per-token quantization. We present a novel, fully hardware-enhanced robust optimized post-training W8A8 quantization framework, ZeroQuant-HERO.
arXiv Detail & Related papers (2023-10-26T18:34:41Z)
DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU. We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training. It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
FP8 versus INT8 for efficient deep learning inference [14.98281493168929]
We compare the performance for both the FP8 and INT formats for efficient on-device inference. We show that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. We conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8.
arXiv Detail & Related papers (2023-03-31T10:29:17Z)
FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings. E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z)
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [80.3586155104237]
FlashAttention is an IO-aware exact attention algorithm for Transformers. It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip. FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
arXiv Detail & Related papers (2022-05-27T17:53:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.