Related papers: FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

URL: http://arxiv.org/abs/2311.05908v1
Date: Fri, 10 Nov 2023 07:33:35 GMT
Title: FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
Authors: Daniel Y. Fu, Hermann Kumbong, Eric Nguyen, Christopher R\'e
Abstract summary: Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks. Fast Fourier Transform (FFT) allows long convolutions to run in $O(N logN)$ time in sequence length $N$ but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution.
Score: 18.016204763652553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FFT)--which allows long convolutions to run in $O(N logN)$ time in sequence length $N$ but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution. We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy. In response, we propose FlashFFTConv. FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O. We also present two sparse convolution algorithms--1) partial convolutions and 2) frequency-sparse convolutions--which can be implemented simply by skipping blocks in the matrix decomposition, enabling further opportunities for memory and compute savings. FlashFFTConv speeds up exact FFT convolutions by up to 7.93$\times$ over PyTorch and achieves up to 4.4$\times$ speedup end-to-end. Given the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity on the PILE and M2-BERT-base to achieve 3.3 points higher GLUE score--matching models with twice the parameter count. FlashFFTConv also achieves 96.1% accuracy on Path-512, a high-resolution vision task where no model had previously achieved better than 50%. Furthermore, partial convolutions enable longer-sequence models--yielding the first DNA model that can process the longest human genes (2.3M base pairs)--and frequency-sparse convolutions speed up pretrained models while maintaining or improving model quality.

Related papers

Orthogonal Finetuning Made Scalable [87.49040247077389]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance.
arXiv Detail & Related papers (2025-06-24T17:59:49Z)
RecConv: Efficient Recursive Convolutions for Multi-Frequency Representations [8.346566205092433]
RecConv is a decomposition strategy that efficiently constructs multi-frequency representations using small- Kernel convolutions. RecNeXt-M3 outperforms RepViT-M1.1 by 1.9 $APbox$ on COCO with similar FLOPs.
arXiv Detail & Related papers (2024-12-27T13:13:52Z)
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning [11.508362885430133]
We exploit the asymmetric GPU memory hierarchy to bring significant memory saving and runtime speedup. FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40% of the theoretical maximum FLOPs/s. We propose FlashAttention-2, with better work partitioning to address these issues.
arXiv Detail & Related papers (2023-07-17T17:50:36Z)
Im2win: An Efficient Convolution Paradigm on GPU [1.9162301033784574]
This paper proposes a paradigm on convolution-based convolutions called im2win, which only reduces memory footprint but also offers continuous memory accesses. We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution, and six$$ DNN-based convolution implementations, with twelve state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-06-25T19:09:56Z)
Simple Hardware-Efficient Long Convolutions for Sequence Modeling [18.3719016967593]
State space models (SSMs) have high performance on long sequence modeling. We study whether a simple alternative can match SSMs in performance and efficiency. We develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions.
arXiv Detail & Related papers (2023-02-13T19:19:23Z)
FInC Flow: Fast and Invertible $k \times k$ Convolutions for Normalizing Flows [2.156373334386171]
Invertible convolutions have been an essential element for building expressive normalizing flow-based generative models. We propose a $k times k$ convolutional layer and Deep Normalizing Flow architecture.
arXiv Detail & Related papers (2023-01-23T04:31:03Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
Early Convolutions Help Transformers See Better [63.21712652156238]
Vision transformer (ViT) models exhibit substandard optimizability. Modern convolutional neural networks are far easier to optimize. Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance.
arXiv Detail & Related papers (2021-06-28T17:59:33Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
Decoupled Dynamic Filter Networks [85.38058820176047]
We propose the Decoupled Dynamic Filter (DDF) that can simultaneously tackle both of these shortcomings. Inspired by recent advances in attention, DDF decouples a depth-wise dynamic filter into spatial and channel dynamic filters. We observe a significant boost in performance when replacing standard convolution with DDF in classification networks.
arXiv Detail & Related papers (2021-04-29T04:55:33Z)
XSepConv: Extremely Separated Convolution [60.90871656244126]
We propose a novel extremely separated convolutional block (XSepConv) It fuses spatially separable convolutions into depthwise convolution to reduce both the computational cost and parameter size of large kernels. XSepConv is designed to be an efficient alternative to vanilla depthwise convolution with large kernel sizes.
arXiv Detail & Related papers (2020-02-27T11:46:17Z)
DFTpy: An efficient and object-oriented platform for orbital-free DFT simulations [55.41644538483948]
In this work, we present DFTpy, an open source software implementing OFDFT written entirely in Python 3. We showcase the electronic structure of a million-atom system of aluminum metal which was computed on a single CPU. DFTpy is released under the MIT license.
arXiv Detail & Related papers (2020-02-07T19:07:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.