Simple Hardware-Efficient Long Convolutions for Sequence Modeling
- URL: http://arxiv.org/abs/2302.06646v1
- Date: Mon, 13 Feb 2023 19:19:23 GMT
- Title: Simple Hardware-Efficient Long Convolutions for Sequence Modeling
- Authors: Daniel Y. Fu, Elliot L. Epstein, Eric Nguyen, Armin W. Thomas, Michael
Zhang, Tri Dao, Atri Rudra, Christopher R\'e
- Abstract summary: State space models (SSMs) have high performance on long sequence modeling.
We study whether a simple alternative can match SSMs in performance and efficiency.
We develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions.
- Score: 18.3719016967593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State space models (SSMs) have high performance on long sequence modeling but
require sophisticated initialization techniques and specialized implementations
for high quality and runtime performance. We study whether a simple alternative
can match SSMs in performance and efficiency: directly learning long
convolutions over the sequence. We find that a key requirement to achieving
high performance is keeping the convolution kernels smooth. We find that simple
interventions--such as squashing the kernel weights--result in smooth kernels
and recover SSM performance on a range of tasks including the long range arena,
image classification, language modeling, and brain data modeling. Next, we
develop FlashButterfly, an IO-aware algorithm to improve the runtime
performance of long convolutions. FlashButterfly appeals to classic Butterfly
decompositions of the convolution to reduce GPU memory IO and increase FLOP
utilization. FlashButterfly speeds up convolutions by 2.2$\times$, and allows
us to train on Path256, a challenging task with sequence length 64K, where we
set state-of-the-art by 29.1 points while training 7.2$\times$ faster than
prior work. Lastly, we introduce an extension to FlashButterfly that learns the
coefficients of the Butterfly decomposition, increasing expressivity without
increasing runtime. Using this extension, we outperform a Transformer on
WikiText103 by 0.2 PPL with 30% fewer parameters.
Related papers
- Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models [33.911521719528686]
Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages.
A promising approach is using Zeroth-Order (ZO) gradients, which estimates to replace First-Order (FO) gradients.
We introduce a novel layer-wise sparse computation and memory efficient ZO, named LeZO.
arXiv Detail & Related papers (2024-10-13T12:47:37Z) - S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs [7.816840847892339]
Speculative decoding (SD) has attracted a significant amount of research attention due to the substantial speedup it can achieve for LLM inference.
We propose Skippy Simultaneous Speculative Decoding (or S3D), a cost-effective self-speculative SD method based on simultaneous multi-token decoding and mid-layer skipping.
Our method has achieved one of the top performance-memory ratios while requiring minimal architecture changes and training data.
arXiv Detail & Related papers (2024-05-30T17:54:35Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre
Memory Units [5.830814457423021]
Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability.
We show how architectural modifications to a recurrent model can help push its performance toward Transformer models.
We present a spiking version of this architecture, which introduces the benefit of states within the patch embedding and channel mixer modules.
arXiv Detail & Related papers (2024-01-20T01:10:18Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU.
We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training.
It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z) - TransNormerLLM: A Faster and Better Large Language Model with Improved
TransNormer [34.790081960470964]
We present TransNormerLLM, the first linear attention-based Large Language Model (LLM)
We make advanced modifications that include positional embedding, linear attention acceleration, gating mechanisms, tensor normalization, and inference acceleration and stabilization.
We validate our model design through a series of ablations and train models with sizes of 385M, 1B, and 7B on our self-collected corpus.
arXiv Detail & Related papers (2023-07-27T16:45:33Z) - FlashAttention-2: Faster Attention with Better Parallelism and Work
Partitioning [11.508362885430133]
We exploit the asymmetric GPU memory hierarchy to bring significant memory saving and runtime speedup.
FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40% of the theoretical maximum FLOPs/s.
We propose FlashAttention-2, with better work partitioning to address these issues.
arXiv Detail & Related papers (2023-07-17T17:50:36Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness [80.3586155104237]
FlashAttention is an IO-aware exact attention algorithm for Transformers.
It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip.
FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
arXiv Detail & Related papers (2022-05-27T17:53:09Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.