TCNCA: Temporal Convolution Network with Chunked Attention for Scalable
Sequence Processing
- URL: http://arxiv.org/abs/2312.05605v1
- Date: Sat, 9 Dec 2023 16:12:25 GMT
- Title: TCNCA: Temporal Convolution Network with Chunked Attention for Scalable
Sequence Processing
- Authors: Aleksandar Terzic, Michael Hersche, Geethan Karunaratne, Luca Benini,
Abu Sebastian, Abbas Rahimi
- Abstract summary: MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as $O(LlogL)$, with $L$ being the sequence length.
We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to $O(L)$.
We evaluate TCNCA on EnWik8 language modeling, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall.
- Score: 52.64837396100988
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: MEGA is a recent transformer-based architecture, which utilizes a linear
recurrent operator whose parallel computation, based on the FFT, scales as
$O(LlogL)$, with $L$ being the sequence length. We build upon their approach by
replacing the linear recurrence with a special temporal convolutional network
which permits larger receptive field size with shallower networks, and reduces
the computational complexity to $O(L)$. The resulting model is called TCNCA, a
Temporal Convolutional Network with Chunked Attention. We evaluate TCNCA on
EnWik8 language modeling, long-range-arena (LRA) sequence classification, as
well as a synthetic reasoning benchmark associative recall. On EnWik8, TCNCA
outperforms MEGA, reaching a lower loss with $1.37\times$/$1.24\times$ faster
forward/backward pass during training. The dilated convolutions used in TCNCA
are consistently and significantly faster operations than the FFT-based
parallelized recurrence in GPUs, making them a scalable candidate for handling
very large sequence lengths: they are up to $7.07\times$/$2.86\times$ faster in
the forward/backward pass for sequences up to 131k. Further on LRA, TCNCA
achieves, on average, $1.28\times$ speed-up during inference with similar
accuracy to what MEGA achieves. On associative recall, we find that even a
simplified version of TCNCA, without excessive multiplicative and additive
interactions, remains superior or competitive to MEGA on a range of sequence
lengths and vocabulary sizes.
Related papers
- Tensor Train Multiplication [0.0]
The computational complexity and memory requirements of the TTM algorithm scale as $chi3$ and $chi2$, respectively.
This represents a significant improvement compared with the conventional approach.
The TTM algorithm paves the way towards GPU accelerated tensor network simulations of computational fluid dynamics problems with large bond dimensions.
arXiv Detail & Related papers (2024-10-10T12:36:49Z) - PRF: Parallel Resonate and Fire Neuron for Long Sequence Learning in Spiking Neural Networks [6.545474731089018]
We address the efficiency and performance challenges of long sequence learning in Spiking Neural Networks (SNNs) simultaneously.
First, we propose a decoupled reset method for parallel spiking neuron training, reducing the typical Leaky Integrate-and-Fire (LIF) model's training time from $O(L2)$ to $O(Llog L)$.
Secondly, to capture long-range dependencies, we propose a Parallel Resonate and Fire (PRF) neuron, which leverages an oscillating membrane potential driven by a resonate mechanism from a differentiable reset function in the complex domain
arXiv Detail & Related papers (2024-10-04T15:51:56Z) - Were RNNs All We Needed? [53.393497486332]
We revisit traditional recurrent neural networks (RNNs) from over a decade ago.
We show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel.
arXiv Detail & Related papers (2024-10-02T03:06:49Z) - Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences.
MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage.
integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z) - A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models.
HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length.
We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU.
We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training.
It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z) - DeepPCR: Parallelizing Sequential Operations in Neural Networks [4.241834259165193]
We introduce DeepPCR, a novel algorithm which parallelizes typically sequential operations in order to speed up inference and training of neural networks.
DeepPCR is based on interpreting a sequence of $L$ steps as the solution of a specific system of equations, which we recover using the Parallel Cyclic Reduction algorithm.
To verify the theoretical lower complexity of the algorithm, and to identify regimes for speedup, we test the effectiveness of DeepPCR in parallelizing the forward and backward pass in multi-layer perceptrons.
arXiv Detail & Related papers (2023-09-28T10:15:30Z) - Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models.
We theoretically derive the connection between recurrence and attention.
Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.