Linear Attention for Efficient Bidirectional Sequence Modeling
- URL: http://arxiv.org/abs/2502.16249v1
- Date: Sat, 22 Feb 2025 14:52:17 GMT
- Title: Linear Attention for Efficient Bidirectional Sequence Modeling
- Authors: Arshia Afzal, Elias Abad Rocamora, Leyla Naz Candogan, Pol Puigdemont, Francesco Tonin, Yongtao Wu, Mahsa Shoaran, Volkan Cevher,
- Abstract summary: This work introduces the LION framework, establishing new theoretical foundations for linear transformers in bidirectional sequence modeling.<n>Using LION, we cast three linear transformers to their bidirectional form: LION-LIT, the bidirectional variant corresponding to (Katharopoulos et al., 2020); LION-D, extending RetNet (Sun et al., 2023); and LION-S, a linear transformer with a stable selective mask inspired by selectivity of SSMs.
- Score: 39.971678682875904
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transformers with linear attention enable fast and parallel training. Moreover, they can be formulated as Recurrent Neural Networks (RNNs), for efficient linear-time inference. While extensively evaluated in causal sequence modeling, they have yet to be extended to the bidirectional setting. This work introduces the LION framework, establishing new theoretical foundations for linear transformers in bidirectional sequence modeling. LION constructs a bidirectional RNN equivalent to full Linear Attention. This extends the benefits of linear transformers: parallel training, and efficient inference, into the bidirectional setting. Using LION, we cast three linear transformers to their bidirectional form: LION-LIT, the bidirectional variant corresponding to (Katharopoulos et al., 2020); LION-D, extending RetNet (Sun et al., 2023); and LION-S, a linear transformer with a stable selective mask inspired by selectivity of SSMs (Dao & Gu, 2024). Replacing the attention block with LION (-LIT, -D, -S) achieves performance on bidirectional tasks that approaches that of Transformers and State-Space Models (SSMs), while delivering significant improvements in training speed. Our implementation is available in http://github.com/LIONS-EPFL/LION.
Related papers
- Liger: Linearizing Large Language Models to Gated Recurrent Structures [9.665802842933209]
Linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures.
Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters.
arXiv Detail & Related papers (2025-03-03T13:08:00Z) - LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid [25.71221522518279]
Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths.
Existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy.
We introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models.
arXiv Detail & Related papers (2025-02-11T14:01:39Z) - LION: Linear Group RNN for 3D Object Detection in Point Clouds [85.97541374148508]
We propose a window-based framework built on LInear grOup RNN for accurate 3D object detection, called LION.
We introduce a 3D spatial feature descriptor and integrate it into the linear group RNN operators to enhance their spatial features.
To further address the challenge in highly sparse point clouds, we propose a 3D voxel generation strategy to densify foreground features.
arXiv Detail & Related papers (2024-07-25T17:50:32Z) - Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.<n>We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently.
We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm.
We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z) - Gated Linear Attention Transformers with Hardware-Efficient Training [60.670102007737476]
This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability.
We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates.
When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention Transformer is found to perform competitively.
arXiv Detail & Related papers (2023-12-11T18:51:59Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.