Related papers: Latte: Latent Attention for Linear Time Transformers

Latte: Latent Attention for Linear Time Transformers

URL: http://arxiv.org/abs/2402.17512v4
Date: Fri, 04 Oct 2024 14:19:27 GMT
Title: Latte: Latent Attention for Linear Time Transformers
Authors: Rares Dolga, Lucas Maystre, Marius Cobzarenco, David Barber,
Abstract summary: We propose a probabilistic framework for attention. Our method can be seamlessly integrated as a drop-in replacement for the standard attention mechanism. The resulting Latte Transformer'' achieves performance comparable to standard attention and other state-of-the-art models.
Score: 11.524573224123905
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The time complexity of the standard attention mechanism in transformers scales quadratically with sequence length. We propose a probabilistic framework for attention, enabling us to derive a novel low-rank linear re-parameterisation of both bidirectional and causal cases, based on defining a latent variable model. Our method can be seamlessly integrated as a drop-in replacement for the standard attention mechanism. Additionally, this framework provides a natural extension for combining local standard attention with our global linear attention. This approach allows us to extend the context length of existing large pre-trained models with only a few additional training steps. The resulting ``Latte Transformer'' achieves performance comparable to standard attention and other state-of-the-art models, while maintaining linear time and memory complexity, along with constant-time next-token prediction during inference.

Related papers

Sequential-Parallel Duality in Prefix Scannable Models [68.39855814099997]
Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba.<n>This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference?
arXiv Detail & Related papers (2025-06-12T17:32:02Z)
Log-Linear Attention [81.09631871212211]
This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention.<n>We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length.<n>Log-linear attention is a general framework and can be applied on top of existing linear attention variants.
arXiv Detail & Related papers (2025-06-05T08:44:51Z)
Sequence Complementor: Complementing Transformers For Time Series Forecasting with Learnable Sequences [5.244482076690776]
We find that expressive capability of sequence representation is a key factor influencing Transformer performance in time forecasting. We propose a novel attention mechanism with Sequence Complementors and prove feasible from an information theory perspective.
arXiv Detail & Related papers (2025-01-06T03:08:39Z)
Timer-XL: Long-Context Transformers for Unified Time Series Forecasting [67.83502953961505]
We present Timer-XL, a generative Transformer for unified time series forecasting. Timer-XL achieves state-of-the-art performance across challenging forecasting benchmarks through a unified approach.
arXiv Detail & Related papers (2024-10-07T07:27:39Z)
PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction. We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences. We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner. Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
Rough Transformers: Lightweight Continuous-Time Sequence Modelling with Path Signatures [46.58170057001437]
We introduce the Rough Transformer, a variation of the Transformer model that operates on continuous-time representations of input sequences. We find that, on a variety of time-series-related tasks, Rough Transformers consistently outperform their vanilla attention counterparts.
arXiv Detail & Related papers (2024-05-31T14:00:44Z)
Linear Log-Normal Attention with Unbiased Concentration [3.034257650900382]
We study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. We propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives.
arXiv Detail & Related papers (2023-11-22T17:30:41Z)
TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series [57.4208255711412]
Building on copula theory, we propose a simplified objective for the recently-introduced transformer-based attentional copulas (TACTiS) We show that the resulting model has significantly better training dynamics and achieves state-of-the-art performance across diverse real-world forecasting tasks.
arXiv Detail & Related papers (2023-10-02T16:45:19Z)
Towards Long-Term Time-Series Forecasting: Feature, Pattern, and Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning. Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism. We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z)
The Devil in Linear Transformer [42.232886799710215]
Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. They usually suffer from degraded performances on various tasks and corpus. In this paper, we identify two key issues that lead to such performance gaps.
arXiv Detail & Related papers (2022-10-19T07:15:35Z)
Mega: Moving Average Equipped Gated Attention [150.3124713793503]
Mega is a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average. We show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.
arXiv Detail & Related papers (2022-09-21T20:52:17Z)
Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting [86.33543833145457]
We propose Non-stationary Transformers as a generic framework with two interdependent modules: Series Stationarization and De-stationary Attention. Our framework consistently boosts mainstream Transformers by a large margin, which reduces MSE by 49.43% on Transformer, 47.34% on Informer, and 46.89% on Reformer.
arXiv Detail & Related papers (2022-05-28T12:27:27Z)
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR) Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.