Gated Delta Networks: Improving Mamba2 with Delta Rule
- URL: http://arxiv.org/abs/2412.06464v1
- Date: Mon, 09 Dec 2024 13:09:04 GMT
- Title: Gated Delta Networks: Improving Mamba2 with Delta Rule
- Authors: Songlin Yang, Jan Kautz, Ali Hatamizadeh,
- Abstract summary: Gated DeltaNet consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks.
We develop hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.
- Score: 64.58149707073915
- License:
- Abstract: Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.
Related papers
- DeltaProduct: Increasing the Expressivity of DeltaNet Through Products of Householders [63.66021758150632]
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling.
Existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices.
arXiv Detail & Related papers (2025-02-14T16:59:05Z) - Birdie: Advancing State Space Models with Reward-Driven Objectives and Curricula [23.071384759427072]
State space models (SSMs) offer advantages over Transformers but struggle with tasks requiring long-range in-context retrieval-like text copying, associative recall, and question answering over long contexts.
We propose a novel training procedure, Birdie, that significantly enhances the in-context retrieval capabilities of SSMs without altering their architecture.
arXiv Detail & Related papers (2024-11-01T21:01:13Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Gated Slot Attention for Efficient Linear-Time Sequence Modeling [59.019501274074564]
Gated Slot Attention (GSA) enhances Attention with Bounded-memory-Control (ABC)
GSA incorporates a gating mechanism inspired by Gated Linear Attention (GLA)
arXiv Detail & Related papers (2024-09-11T09:49:50Z) - SIGMA: Selective Gated Mamba for Sequential Recommendation [56.85338055215429]
Mamba, a recent advancement, has exhibited exceptional performance in time series prediction.
We introduce a new framework named Selective Gated Mamba ( SIGMA) for Sequential Recommendation.
Our results indicate that SIGMA outperforms current models on five real-world datasets.
arXiv Detail & Related papers (2024-08-21T09:12:59Z) - DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework [2.187990941788468]
generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio.
Model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture.
arXiv Detail & Related papers (2024-08-01T08:22:47Z) - Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - Efficient Sparsely Activated Transformers [0.34410212782758054]
Transformer-based neural networks have achieved state-of-the-art task performance in a number of machine learning domains.
Recent work has explored the integration of dynamic behavior into these networks in the form of mixture-of-expert layers.
We introduce a novel system named PLANER that takes an existing Transformer-based network and a user-defined latency target.
arXiv Detail & Related papers (2022-08-31T00:44:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.