ReGLA: Refining Gated Linear Attention
- URL: http://arxiv.org/abs/2502.01578v2
- Date: Wed, 05 Feb 2025 21:00:16 GMT
- Title: ReGLA: Refining Gated Linear Attention
- Authors: Peng Lu, Ivan Kobyzev, Mehdi Rezagholizadeh, Boxing Chen, Philippe Langlais,
- Abstract summary: Linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers.
We developed a feature mapping function to address some crucial issues that previous suggestions overlooked.
We also explored the saturation phenomenon of the gating mechanism and augmented it with a refining module.
- Score: 42.97193398172823
- License:
- Abstract: Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the training process. Moreover, we explored the saturation phenomenon of the gating mechanism and augmented it with a refining module. We conducted extensive experiments and showed our architecture outperforms previous Gated Linear Attention mechanisms in extensive tasks including training from scratch and post-linearization with continual pre-training.
Related papers
- HyLiFormer: Hyperbolic Linear Attention for Skeleton-based Human Action Recognition [20.45747733568704]
We propose HyLiFormer, a novel hyperbolic linear attention Transformer tailored for skeleton-based action recognition.
Our approach incorporates a Hyperbolic Transformation with Curvatures (HTC) module to map skeleton data into hyperbolic space and a Hyperbolic Linear Attention (HLA) module for efficient long-range dependency modeling.
arXiv Detail & Related papers (2025-02-09T12:08:03Z) - CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up [64.38715211969516]
We introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token.
Experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity.
arXiv Detail & Related papers (2024-12-20T17:57:09Z) - RecurFormer: Not All Transformer Heads Need Self-Attention [14.331807060659902]
Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference.
We propose RecurFormer, a novel architecture that replaces certain attention heads with linear recurrent neural networks.
arXiv Detail & Related papers (2024-10-10T15:24:12Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Refined Mechanism Design for Approximately Structured Priors via Active
Regression [50.71772232237571]
We consider the problem of a revenue-maximizing seller with a large number of items for sale to $n$ strategic bidders.
It is well-known that optimal and even approximately-optimal mechanisms for this setting are notoriously difficult to characterize or compute.
arXiv Detail & Related papers (2023-10-11T20:34:17Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Infusing Lattice Symmetry Priors in Attention Mechanisms for
Sample-Efficient Abstract Geometric Reasoning [45.4605460163454]
Abstraction and Reasoning (ARC) has been postulated as an important step towards general AI.
We argue that solving these tasks requires extreme generalization that can only be achieved by proper accounting for core knowledge priors.
We introduce LatFormer, a model that incorporates lattice priors in attention masks.
arXiv Detail & Related papers (2023-06-05T18:32:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.