FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor
Cores
- URL: http://arxiv.org/abs/2311.05908v1
- Date: Fri, 10 Nov 2023 07:33:35 GMT
- Title: FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor
Cores
- Authors: Daniel Y. Fu, Hermann Kumbong, Eric Nguyen, Christopher R\'e
- Abstract summary: Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks.
Fast Fourier Transform (FFT) allows long convolutions to run in $O(N logN)$ time in sequence length $N$ but has poor hardware utilization.
In this paper, we study how to optimize the FFT convolution.
- Score: 18.016204763652553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolution models with long filters have demonstrated state-of-the-art
reasoning abilities in many long-sequence tasks but lag behind the most
optimized Transformers in wall-clock time. A major bottleneck is the Fast
Fourier Transform (FFT)--which allows long convolutions to run in $O(N logN)$
time in sequence length $N$ but has poor hardware utilization. In this paper,
we study how to optimize the FFT convolution. We find two key bottlenecks: the
FFT does not effectively use specialized matrix multiply units, and it incurs
expensive I/O between layers of the memory hierarchy. In response, we propose
FlashFFTConv. FlashFFTConv uses a matrix decomposition that computes the FFT
using matrix multiply units and enables kernel fusion for long sequences,
reducing I/O. We also present two sparse convolution algorithms--1) partial
convolutions and 2) frequency-sparse convolutions--which can be implemented
simply by skipping blocks in the matrix decomposition, enabling further
opportunities for memory and compute savings. FlashFFTConv speeds up exact FFT
convolutions by up to 7.93$\times$ over PyTorch and achieves up to 4.4$\times$
speedup end-to-end. Given the same compute budget, FlashFFTConv allows
Hyena-GPT-s to achieve 2.3 points better perplexity on the PILE and
M2-BERT-base to achieve 3.3 points higher GLUE score--matching models with
twice the parameter count. FlashFFTConv also achieves 96.1% accuracy on
Path-512, a high-resolution vision task where no model had previously achieved
better than 50%. Furthermore, partial convolutions enable longer-sequence
models--yielding the first DNA model that can process the longest human genes
(2.3M base pairs)--and frequency-sparse convolutions speed up pretrained models
while maintaining or improving model quality.
Related papers
- FlashAttention-2: Faster Attention with Better Parallelism and Work
Partitioning [11.508362885430133]
We exploit the asymmetric GPU memory hierarchy to bring significant memory saving and runtime speedup.
FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40% of the theoretical maximum FLOPs/s.
We propose FlashAttention-2, with better work partitioning to address these issues.
arXiv Detail & Related papers (2023-07-17T17:50:36Z) - Im2win: An Efficient Convolution Paradigm on GPU [1.9162301033784574]
This paper proposes a paradigm on convolution-based convolutions called im2win, which only reduces memory footprint but also offers continuous memory accesses.
We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution, and six$$ DNN-based convolution implementations, with twelve state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-06-25T19:09:56Z) - Simple Hardware-Efficient Long Convolutions for Sequence Modeling [18.3719016967593]
State space models (SSMs) have high performance on long sequence modeling.
We study whether a simple alternative can match SSMs in performance and efficiency.
We develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions.
arXiv Detail & Related papers (2023-02-13T19:19:23Z) - FInC Flow: Fast and Invertible $k \times k$ Convolutions for Normalizing
Flows [2.156373334386171]
Invertible convolutions have been an essential element for building expressive normalizing flow-based generative models.
We propose a $k times k$ convolutional layer and Deep Normalizing Flow architecture.
arXiv Detail & Related papers (2023-01-23T04:31:03Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Early Convolutions Help Transformers See Better [63.21712652156238]
Vision transformer (ViT) models exhibit substandard optimizability.
Modern convolutional neural networks are far easier to optimize.
Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance.
arXiv Detail & Related papers (2021-06-28T17:59:33Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Decoupled Dynamic Filter Networks [85.38058820176047]
We propose the Decoupled Dynamic Filter (DDF) that can simultaneously tackle both of these shortcomings.
Inspired by recent advances in attention, DDF decouples a depth-wise dynamic filter into spatial and channel dynamic filters.
We observe a significant boost in performance when replacing standard convolution with DDF in classification networks.
arXiv Detail & Related papers (2021-04-29T04:55:33Z) - XSepConv: Extremely Separated Convolution [60.90871656244126]
We propose a novel extremely separated convolutional block (XSepConv)
It fuses spatially separable convolutions into depthwise convolution to reduce both the computational cost and parameter size of large kernels.
XSepConv is designed to be an efficient alternative to vanilla depthwise convolution with large kernel sizes.
arXiv Detail & Related papers (2020-02-27T11:46:17Z) - DFTpy: An efficient and object-oriented platform for orbital-free DFT
simulations [55.41644538483948]
In this work, we present DFTpy, an open source software implementing OFDFT written entirely in Python 3.
We showcase the electronic structure of a million-atom system of aluminum metal which was computed on a single CPU.
DFTpy is released under the MIT license.
arXiv Detail & Related papers (2020-02-07T19:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.