Mimetic Initialization Helps State Space Models Learn to Recall
- URL: http://arxiv.org/abs/2410.11135v1
- Date: Mon, 14 Oct 2024 23:17:46 GMT
- Title: Mimetic Initialization Helps State Space Models Learn to Recall
- Authors: Asher Trockman, Hrayr Harutyunyan, J. Zico Kolter, Sanjiv Kumar, Srinadh Bhojanapalli,
- Abstract summary: Recent work has shown that state space models such as Mamba are significantly worse than Transformers on recall-based tasks.
We investigate whether their poor copying and recall performance could be due in part to training difficulties rather than fundamental capacity constraints.
- Score: 81.43140985343358
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has shown that state space models such as Mamba are significantly worse than Transformers on recall-based tasks due to the fact that their state size is constant with respect to their input sequence length. But in practice, state space models have fairly large state sizes, and we conjecture that they should be able to perform much better at these tasks than previously reported. We investigate whether their poor copying and recall performance could be due in part to training difficulties rather than fundamental capacity constraints. Based on observations of their "attention" maps, we propose a structured initialization technique that allows state space layers to more readily mimic attention. Across a variety of architecture settings, our initialization makes it substantially easier for Mamba to learn to copy and do associative recall from scratch.
Related papers
- A Separable Self-attention Inspired by the State Space Model for Computer Vision [9.958579689420253]
Mamba is an efficient State Space Model with linear computational complexity.
Recent studies have shown that there is a rich theoretical connection between state space models and attention variants.
We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention.
arXiv Detail & Related papers (2025-01-03T15:23:36Z) - Taipan: Efficient and Expressive State Space Language Models with Selective Attention [100.16383527459429]
Long-context language modeling is a significant challenge in Natural Language Processing (NLP)
Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval.
We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs)
Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
arXiv Detail & Related papers (2024-10-24T09:25:37Z) - Revealing and Mitigating the Local Pattern Shortcuts of Mamba [25.19835905377437]
We introduce a global selection module into the Mamba model to address this issue.
With the introduction of only 4M extra parameters, our approach enables the Mamba model(130M) to achieve a significant improvement on tasks with distributed information.
arXiv Detail & Related papers (2024-10-21T06:42:11Z) - Stuffed Mamba: Oversized States Lead to the Inability to Forget [69.36377985746878]
We show that Mamba-based models struggle to effectively forget earlier tokens even with built-in forgetting mechanisms.<n>We show that the minimum training length required for the model to learn forgetting scales linearly with the state size, and the maximum context length for accurate retrieval of a 5-digit passkey scales exponentially with the state size.<n>Our work suggests that future RNN designs must account for the interplay between state size, training length, and forgetting mechanisms to achieve robust performance in long-context tasks.
arXiv Detail & Related papers (2024-10-09T17:54:28Z) - OccMamba: Semantic Occupancy Prediction with State Space Models [16.646162677831985]
We present the first Mamba-based network for semantic occupancy prediction, termed OccMamba.
We present a simple yet effective 3D-to-1D reordering operation, i.e., height-prioritized 2D Hilbert expansion.
OccMamba achieves state-of-the-art performance on three prevalent occupancy prediction benchmarks.
arXiv Detail & Related papers (2024-08-19T10:07:00Z) - B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory [91.81390121042192]
We develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an composable module.
B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens.
arXiv Detail & Related papers (2024-07-08T18:41:01Z) - Mamba YOLO: A Simple Baseline for Object Detection with State Space Model [10.44725284994877]
YOLO series has set a new benchmark for real-time object detectors.
Transformer-based structures have emerged as the most powerful solution.
However, the quadratic complexity of the self-attentive mechanism increases the computational burden.
We introduce a simple yet effective baseline approach called Mamba YOLO.
arXiv Detail & Related papers (2024-06-09T15:56:19Z) - Simple linear attention language models balance the recall-throughput
tradeoff [40.08746299497935]
We propose BASED, a simple architecture combining linear and sliding window attention.
We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z) - PointMamba: A Simple State Space Model for Point Cloud Analysis [65.59944745840866]
We propose PointMamba, transferring the success of Mamba, a recent representative state space model (SSM), from NLP to point cloud analysis tasks.
Unlike traditional Transformers, PointMamba employs a linear complexity algorithm, presenting global modeling capacity while significantly reducing computational costs.
arXiv Detail & Related papers (2024-02-16T14:56:13Z) - Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - CNN with large memory layers [2.368995563245609]
This work is centred around the recently proposed product key memory structure citelarge_memory, implemented for a number of computer vision applications.
The memory structure can be regarded as a simple computation primitive suitable to be augmented to nearly all neural network architectures.
arXiv Detail & Related papers (2021-01-27T20:58:20Z) - Sparse Graphical Memory for Robust Planning [93.39298821537197]
We introduce Sparse Graphical Memory (SGM), a new data structure that stores states and feasible transitions in a sparse memory.
SGM aggregates states according to a novel two-way consistency objective, adapting classic state aggregation criteria to goal-conditioned RL.
We show that SGM significantly outperforms current state of the art methods on long horizon, sparse-reward visual navigation tasks.
arXiv Detail & Related papers (2020-03-13T17:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.