Related papers: TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing

TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing

URL: http://arxiv.org/abs/2312.05605v1
Date: Sat, 9 Dec 2023 16:12:25 GMT
Title: TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing
Authors: Aleksandar Terzic, Michael Hersche, Geethan Karunaratne, Luca Benini, Abu Sebastian, Abbas Rahimi
Abstract summary: MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as $O(LlogL)$, with $L$ being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to $O(L)$. We evaluate TCNCA on EnWik8 language modeling, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall.
Score: 52.64837396100988
License: http://creativecommons.org/licenses/by/4.0/
Abstract: MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as $O(LlogL)$, with $L$ being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to $O(L)$. The resulting model is called TCNCA, a Temporal Convolutional Network with Chunked Attention. We evaluate TCNCA on EnWik8 language modeling, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall. On EnWik8, TCNCA outperforms MEGA, reaching a lower loss with $1.37\times$/$1.24\times$ faster forward/backward pass during training. The dilated convolutions used in TCNCA are consistently and significantly faster operations than the FFT-based parallelized recurrence in GPUs, making them a scalable candidate for handling very large sequence lengths: they are up to $7.07\times$/$2.86\times$ faster in the forward/backward pass for sequences up to 131k. Further on LRA, TCNCA achieves, on average, $1.28\times$ speed-up during inference with similar accuracy to what MEGA achieves. On associative recall, we find that even a simplified version of TCNCA, without excessive multiplicative and additive interactions, remains superior or competitive to MEGA on a range of sequence lengths and vocabulary sizes.

Related papers

Tensor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations [63.945006006152035]
tensor decomposition networks (TDNs) achieve competitive performance with dramatic speedup in computations.<n>We evaluate TDNs on PubChemQCR, a newly curated molecular relaxation dataset containing 105 million DFT-calculated snapshots.
arXiv Detail & Related papers (2025-07-01T18:46:27Z)
Efficient Parallel Training Methods for Spiking Neural Networks with Constant Time Complexity [63.56009745101597]
Spiking Neural Networks (SNNs) often suffer from high time complexity $O(T)$ due to the sequential processing of $T$ spikes.<n>We propose a novel Fixed-point Parallel Training (FPT) method to accelerate SNN training without modifying the network architecture.
arXiv Detail & Related papers (2025-06-10T13:27:27Z)
Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts [5.585952216289788]
Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity.<n>Recurrent Memory (RMTs) offer a solution by reducing the cost linear time and constant memory usage.<n>But their memory update mechanism leads to sequential execution causing a performance bottleneck.<n>We introduce Diagonal, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence.
arXiv Detail & Related papers (2025-06-05T16:43:48Z)
Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling [53.925413758281096]
We present LrcSSM, a $textitnonlinear$ recurrent model that processes long sequences as fast as today's linear state-space layers.<n>LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 and Mamba do not provide.<n>We show that on a series of long-range forecasting tasks, LrcSSM outperforms LRU, S5 and Mamba.
arXiv Detail & Related papers (2025-05-27T20:02:59Z)
Tensor Train Multiplication [0.0]
The computational complexity and memory requirements of the TTM algorithm scale as $chi3$ and $chi2$, respectively. This represents a significant improvement compared with the conventional approach. The TTM algorithm paves the way towards GPU accelerated tensor network simulations of computational fluid dynamics problems with large bond dimensions.
arXiv Detail & Related papers (2024-10-10T12:36:49Z)
PRF: Parallel Resonate and Fire Neuron for Long Sequence Learning in Spiking Neural Networks [6.545474731089018]
We address the efficiency and performance challenges of long sequence learning in Spiking Neural Networks (SNNs) simultaneously. First, we propose a decoupled reset method for parallel spiking neuron training, reducing the typical Leaky Integrate-and-Fire (LIF) model's training time from $O(L2)$ to $O(Llog L)$. Secondly, to capture long-range dependencies, we propose a Parallel Resonate and Fire (PRF) neuron, which leverages an oscillating membrane potential driven by a resonate mechanism from a differentiable reset function in the complex domain
arXiv Detail & Related papers (2024-10-04T15:51:56Z)
Were RNNs All We Needed? [53.393497486332]
We revisit traditional recurrent neural networks (RNNs) from over a decade ago. We show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel.
arXiv Detail & Related papers (2024-10-02T03:06:49Z)
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models. HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU. We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training. It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z)
DeepPCR: Parallelizing Sequential Operations in Neural Networks [4.241834259165193]
We introduce DeepPCR, a novel algorithm which parallelizes typically sequential operations in order to speed up inference and training of neural networks. DeepPCR is based on interpreting a sequence of $L$ steps as the solution of a specific system of equations, which we recover using the Parallel Cyclic Reduction algorithm. To verify the theoretical lower complexity of the algorithm, and to identify regimes for speedup, we test the effectiveness of DeepPCR in parallelizing the forward and backward pass in multi-layer perceptrons.
arXiv Detail & Related papers (2023-09-28T10:15:30Z)
Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models. We theoretically derive the connection between recurrence and attention. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z)
Scalable Quantum Error Correction for Surface Codes using FPGA [67.74017895815125]
A fault-tolerant quantum computer must decode and correct errors faster than they appear. We report a distributed version of the Union-Find decoder that exploits parallel computing resources for further speedup. The implementation employs a scalable architecture called Helios that organizes parallel computing resources into a hybrid tree-grid structure.
arXiv Detail & Related papers (2023-01-20T04:23:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.