ReGLA: Refining Gated Linear Attention
- URL: http://arxiv.org/abs/2502.01578v2
- Date: Wed, 05 Feb 2025 21:00:16 GMT
- Title: ReGLA: Refining Gated Linear Attention
- Authors: Peng Lu, Ivan Kobyzev, Mehdi Rezagholizadeh, Boxing Chen, Philippe Langlais,
- Abstract summary: Linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers.<n>We developed a feature mapping function to address some crucial issues that previous suggestions overlooked.<n>We also explored the saturation phenomenon of the gating mechanism and augmented it with a refining module.
- Score: 42.97193398172823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the training process. Moreover, we explored the saturation phenomenon of the gating mechanism and augmented it with a refining module. We conducted extensive experiments and showed our architecture outperforms previous Gated Linear Attention mechanisms in extensive tasks including training from scratch and post-linearization with continual pre-training.
Related papers
- Offline Model-Based Optimization: Comprehensive Review [61.91350077539443]
offline optimization is a fundamental challenge in science and engineering, where the goal is to optimize black-box functions using only offline datasets.
Recent advances in model-based optimization have harnessed the generalization capabilities of deep neural networks to develop offline-specific surrogate and generative models.
Despite its growing impact in accelerating scientific discovery, the field lacks a comprehensive review.
arXiv Detail & Related papers (2025-03-21T16:35:02Z) - In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.
Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z) - HyLiFormer: Hyperbolic Linear Attention for Skeleton-based Human Action Recognition [20.45747733568704]
We propose HyLiFormer, a novel hyperbolic linear attention Transformer tailored for skeleton-based action recognition.
Our approach incorporates a Hyperbolic Transformation with Curvatures (HTC) module to map skeleton data into hyperbolic space and a Hyperbolic Linear Attention (HLA) module for efficient long-range dependency modeling.
arXiv Detail & Related papers (2025-02-09T12:08:03Z) - Test-time regression: a unifying framework for designing sequence models with associative memory [24.915262407519876]
We introduce a unifying framework to understand and derive sequence models.
We formalize associative recall as a two-step process, memorization and retrieval, casting as a regression problem.
Our work bridges sequence modeling with classic regression methods, paving the way for developing more powerful and theoretically principled architectures.
arXiv Detail & Related papers (2025-01-21T18:32:31Z) - CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up [64.38715211969516]
We introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token.<n>Experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity.
arXiv Detail & Related papers (2024-12-20T17:57:09Z) - ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models [3.7802450241986945]
LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization.
This work explores desirable activation functions in normalization-free decoder-only LLMs.
ReLU significantly outperforms GELU in LayerNorm-free models, leading to an bf 8.2% perplexity improvement.
arXiv Detail & Related papers (2024-10-12T20:26:01Z) - RecurFormer: Not All Transformer Heads Need Self-Attention [14.331807060659902]
Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference.
We propose RecurFormer, a novel architecture that replaces certain attention heads with linear recurrent neural networks.
arXiv Detail & Related papers (2024-10-10T15:24:12Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Self-Supervised Learning with Generative Adversarial Networks for Electron Microscopy [0.0]
We show how self-supervised pretraining facilitates efficient fine-tuning for a spectrum of downstream tasks.
We demonstrate the versatility of self-supervised pretraining across various downstream tasks in the context of electron microscopy.
arXiv Detail & Related papers (2024-02-28T12:25:01Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.