FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs
- URL: http://arxiv.org/abs/2410.16663v1
- Date: Tue, 22 Oct 2024 03:29:33 GMT
- Title: FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs
- Authors: Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu,
- Abstract summary: FlashAttention series has been widely applied in the inference of large language models (LLMs)
However, FlashAttention series only supports the high-level GPU architectures, e.g., Ampere and Hopper.
In this work, we propose FastAttention which pioneers the adaptation of FlashAttention for NPUs and low-resource GPU.
- Score: 25.016652939042824
- License:
- Abstract: FlashAttention series has been widely applied in the inference of large language models (LLMs). However, FlashAttention series only supports the high-level GPU architectures, e.g., Ampere and Hopper. At present, FlashAttention series is not easily transferrable to NPUs and low-resource GPUs. Moreover, FlashAttention series is inefficient for multi- NPUs or GPUs inference scenarios. In this work, we propose FastAttention which pioneers the adaptation of FlashAttention series for NPUs and low-resource GPUs to boost LLM inference efficiency. Specifically, we take Ascend NPUs and Volta-based GPUs as representatives for designing our FastAttention. We migrate FlashAttention series to Ascend NPUs by proposing a novel two-level tiling strategy for runtime speedup, tiling-mask strategy for memory saving and the tiling-AllReduce strategy for reducing communication overhead, respectively. Besides, we adapt FlashAttention for Volta-based GPUs by redesigning the operands layout in shared memory and introducing a simple yet effective CPU-GPU cooperative strategy for efficient memory utilization. On Ascend NPUs, our FastAttention can achieve a 10.7$\times$ speedup compared to the standard attention implementation. Llama-7B within FastAttention reaches up to 5.16$\times$ higher throughput than within the standard attention. On Volta architecture GPUs, FastAttention yields 1.43$\times$ speedup compared to its equivalents in \texttt{xformers}. Pangu-38B within FastAttention brings 1.46$\times$ end-to-end speedup using FasterTransformer. Coupled with the propose CPU-GPU cooperative strategy, FastAttention supports a maximal input length of 256K on 8 V100 GPUs. All the codes will be made available soon.
Related papers
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision [14.426543629408984]
Attention is the bottleneck for large language models and long-context applications.
We develop three main techniques to speed up attention on GPUs.
We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPU by 1.5-2.0$times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization) and with FP8 reaching close to 1.2 PFLOPs/s.
arXiv Detail & Related papers (2024-07-11T15:44:48Z) - Video-Infinity: Distributed Long Video Generation [73.30145218077074]
Diffusion models have recently achieved remarkable results for video generation.
Our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.
arXiv Detail & Related papers (2024-06-24T01:56:12Z) - NeRF-XL: Scaling NeRFs with Multiple GPUs [72.75214892939411]
We present NeRF-XL, a principled method for distributing Neural Radiance Fields (NeRFs) across multiple GPU.
We show improvements in reconstruction quality with larger parameter counts and speed improvements with more GPU.
We demonstrate the effectiveness of NeRF-XL on a wide variety of datasets, including the largest open-source dataset to date, MatrixCity, containing 258K images covering a 25km2 city area.
arXiv Detail & Related papers (2024-04-24T21:43:15Z) - FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems [5.572152653851948]
FULL-W2V exploits the opportunities for data reuse in the W2V algorithm to reduce access to low memory levels and improve temporal locality.
Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality.
arXiv Detail & Related papers (2023-12-12T21:22:07Z) - DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU.
We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training.
It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z) - FlashAttention-2: Faster Attention with Better Parallelism and Work
Partitioning [11.508362885430133]
We exploit the asymmetric GPU memory hierarchy to bring significant memory saving and runtime speedup.
FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40% of the theoretical maximum FLOPs/s.
We propose FlashAttention-2, with better work partitioning to address these issues.
arXiv Detail & Related papers (2023-07-17T17:50:36Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness [80.3586155104237]
FlashAttention is an IO-aware exact attention algorithm for Transformers.
It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip.
FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
arXiv Detail & Related papers (2022-05-27T17:53:09Z) - TFApprox: Towards a Fast Emulation of DNN Approximate Hardware
Accelerators on GPU [0.4817429789586127]
Energy efficiency of hardware accelerators of deep neural networks (DNN) can be improved by introducing approximate arithmetic circuits.
A software emulation of the DNN accelerator is usually executed on CPU or GPU.
This emulation is typically two or three orders of magnitude slower than a software DNN implementation running on or emulated.
arXiv Detail & Related papers (2020-02-21T08:22:56Z) - Efficient Video Semantic Segmentation with Labels Propagation and
Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.