Gated Delta Networks: Improving Mamba2 with Delta Rule
- URL: http://arxiv.org/abs/2412.06464v1
- Date: Mon, 09 Dec 2024 13:09:04 GMT
- Title: Gated Delta Networks: Improving Mamba2 with Delta Rule
- Authors: Songlin Yang, Jan Kautz, Ali Hatamizadeh,
- Abstract summary: Gated DeltaNet consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks.<n>We develop hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.
- Score: 64.58149707073915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.
Related papers
- TransMamba: Flexibly Switching between Transformer and Mamba [43.20757187382281]
This paper proposes TransMamba, a framework that unifies Transformer and Mamba.
We show that TransMamba achieves superior training efficiency and performance compared to baselines.
arXiv Detail & Related papers (2025-03-31T13:26:24Z) - Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.
This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation.
Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z) - DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products [63.66021758150632]
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling.
Existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices.
We introduce DeltaProduct, which takes multiple ($n_h$) steps per token and achieves superior state-tracking and language modeling capabilities.
arXiv Detail & Related papers (2025-02-14T16:59:05Z) - Birdie: Advancing State Space Models with Reward-Driven Objectives and Curricula [23.071384759427072]
State space models (SSMs) offer advantages over Transformers but struggle with tasks requiring long-range in-context retrieval-like text copying, associative recall, and question answering over long contexts.
We propose a novel training procedure, Birdie, that significantly enhances the in-context retrieval capabilities of SSMs without altering their architecture.
arXiv Detail & Related papers (2024-11-01T21:01:13Z) - Gated Slot Attention for Efficient Linear-Time Sequence Modeling [59.019501274074564]
Gated Slot Attention (GSA) enhances Attention with Bounded-memory-Control (ABC)
GSA incorporates a gating mechanism inspired by Gated Linear Attention (GLA)
arXiv Detail & Related papers (2024-09-11T09:49:50Z) - SIGMA: Selective Gated Mamba for Sequential Recommendation [56.85338055215429]
Mamba, a recent advancement, has exhibited exceptional performance in time series prediction.
We introduce a new framework named Selective Gated Mamba ( SIGMA) for Sequential Recommendation.
Our results indicate that SIGMA outperforms current models on five real-world datasets.
arXiv Detail & Related papers (2024-08-21T09:12:59Z) - DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework [2.187990941788468]
generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio.
Model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture.
arXiv Detail & Related papers (2024-08-01T08:22:47Z) - Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.