Related papers: Latent Attention for Linear Time Transformers

Latent Attention for Linear Time Transformers

URL: http://arxiv.org/abs/2402.17512v2
Date: Mon, 4 Mar 2024 12:21:52 GMT
Title: Latent Attention for Linear Time Transformers
Authors: Rares Dolga, Marius Cobzarenco, David Barber
Abstract summary: "Latte Transformer" model can be implemented for both bidirectional and unidirectional tasks. "Latte Transformer" model can be implemented for both bidirectional and unidirectional tasks.
Score: 8.640180203900583
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The time complexity of the standard attention mechanism in a transformer scales quadratically with the length of the sequence. We introduce a method to reduce this to linear scaling with time, based on defining attention via latent vectors. The method is readily usable as a drop-in replacement for the standard attention mechanism. Our "Latte Transformer" model can be implemented for both bidirectional and unidirectional tasks, with the causal version allowing a recurrent implementation which is memory and time-efficient during inference of language generation tasks. Whilst next token prediction scales linearly with the sequence length for a standard transformer, a Latte Transformer requires constant time to compute the next token. The empirical performance of our method is comparable to standard attention, yet allows scaling to context windows much larger than practical in standard attention.

Related papers

Sequential-Parallel Duality in Prefix Scannable Models [68.39855814099997]
Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba.<n>This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference?
arXiv Detail & Related papers (2025-06-12T17:32:02Z)
Log-Linear Attention [81.09631871212211]
This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention.<n>We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length.<n>Log-linear attention is a general framework and can be applied on top of existing linear attention variants.
arXiv Detail & Related papers (2025-06-05T08:44:51Z)
Sequence Complementor: Complementing Transformers For Time Series Forecasting with Learnable Sequences [5.244482076690776]
We find that expressive capability of sequence representation is a key factor influencing Transformer performance in time forecasting. We propose a novel attention mechanism with Sequence Complementors and prove feasible from an information theory perspective.
arXiv Detail & Related papers (2025-01-06T03:08:39Z)
Timer-XL: Long-Context Transformers for Unified Time Series Forecasting [67.83502953961505]
We present Timer-XL, a generative Transformer for unified time series forecasting. Timer-XL achieves state-of-the-art performance across challenging forecasting benchmarks through a unified approach.
arXiv Detail & Related papers (2024-10-07T07:27:39Z)
PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction. We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences. We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner. Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
Rough Transformers: Lightweight Continuous-Time Sequence Modelling with Path Signatures [46.58170057001437]
We introduce the Rough Transformer, a variation of the Transformer model that operates on continuous-time representations of input sequences. We find that, on a variety of time-series-related tasks, Rough Transformers consistently outperform their vanilla attention counterparts.
arXiv Detail & Related papers (2024-05-31T14:00:44Z)
Linear Log-Normal Attention with Unbiased Concentration [3.034257650900382]
We study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. We propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives.
arXiv Detail & Related papers (2023-11-22T17:30:41Z)
TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series [57.4208255711412]
Building on copula theory, we propose a simplified objective for the recently-introduced transformer-based attentional copulas (TACTiS) We show that the resulting model has significantly better training dynamics and achieves state-of-the-art performance across diverse real-world forecasting tasks.
arXiv Detail & Related papers (2023-10-02T16:45:19Z)
Towards Long-Term Time-Series Forecasting: Feature, Pattern, and Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning. Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism. We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z)
The Devil in Linear Transformer [42.232886799710215]
Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. They usually suffer from degraded performances on various tasks and corpus. In this paper, we identify two key issues that lead to such performance gaps.
arXiv Detail & Related papers (2022-10-19T07:15:35Z)
Mega: Moving Average Equipped Gated Attention [150.3124713793503]
Mega is a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average. We show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.
arXiv Detail & Related papers (2022-09-21T20:52:17Z)
Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting [86.33543833145457]
We propose Non-stationary Transformers as a generic framework with two interdependent modules: Series Stationarization and De-stationary Attention. Our framework consistently boosts mainstream Transformers by a large margin, which reduces MSE by 49.43% on Transformer, 47.34% on Informer, and 46.89% on Reformer.
arXiv Detail & Related papers (2022-05-28T12:27:27Z)
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR) Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.