Hungry Hungry Hippos: Towards Language Modeling with State Space Models
- URL: http://arxiv.org/abs/2212.14052v3
- Date: Sat, 29 Apr 2023 03:18:40 GMT
- Title: Hungry Hungry Hippos: Towards Language Modeling with State Space Models
- Authors: Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra,
Christopher R\'e
- Abstract summary: State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling.
We make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention.
- Score: 17.412372994222114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State space models (SSMs) have demonstrated state-of-the-art sequence
modeling performance in some modalities, but underperform attention in language
modeling. Moreover, despite scaling nearly linearly in sequence length instead
of quadratically, SSMs are still slower than Transformers due to poor hardware
utilization. In this paper, we make progress on understanding the expressivity
gap between SSMs and attention in language modeling, and on reducing the
hardware barrier between SSMs and attention. First, we use synthetic language
modeling tasks to understand the gap between SSMs and attention. We find that
existing SSMs struggle with two capabilities: recalling earlier tokens in the
sequence and comparing tokens across the sequence. To understand the impact on
language modeling, we propose a new SSM layer, H3, that is explicitly designed
for these abilities. H3 matches attention on the synthetic languages and comes
within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid
125M-parameter H3-attention model that retains two attention layers
surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to
improve the efficiency of training SSMs on modern hardware, we propose
FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on
sequences up to 8K, and introduces a novel state passing algorithm that
exploits the recurrent properties of SSMs to scale to longer sequences.
FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows
hybrid language models to generate text 2.4$\times$ faster than Transformers.
Using FlashConv, we scale hybrid H3-attention language models up to 2.7B
parameters on the Pile and find promising initial results, achieving lower
perplexity than Transformers and outperforming Transformers in zero- and
few-shot learning on a majority of tasks in the SuperGLUE benchmark.
Related papers
- FTMoMamba: Motion Generation with Frequency and Text State Space Models [53.60865359814126]
We propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model and a Text State Space Model.
To learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components.
To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level.
arXiv Detail & Related papers (2024-11-26T15:48:12Z) - Flash STU: Fast Spectral Transform Units [19.889367504937177]
Flash STU is a hybrid model that interleaves spectral state space model layers with sliding window attention.
We evaluate the Flash STU and its variants on diverse sequence prediction tasks, including linear dynamical systems, robotics control, and language modeling.
arXiv Detail & Related papers (2024-09-16T17:22:34Z) - Longhorn: State Space Models are Amortized Online Learners [51.10124201221601]
State-space models (SSMs) offer linear decoding efficiency while maintaining parallelism during training.
In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems.
We introduce a novel deep SSM architecture, Longhorn, whose update resembles the closed-form solution for solving the online associative recall problem.
arXiv Detail & Related papers (2024-07-19T11:12:08Z) - Tandem Transformers for Inference Efficient LLMs [49.75726447408795]
We introduce a novel architecture, Tandem transformers, to address these issues.
This architecture uniquely combines a small autoregressive model and a large model operating in block mode.
On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy.
arXiv Detail & Related papers (2024-02-13T18:24:08Z) - Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - Block-State Transformers [41.57016890030355]
State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies.
We propose a hybrid layer named Block-State Transformer (BST) that internally combines an SSM sublayer for long-range contextualization.
We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences.
arXiv Detail & Related papers (2023-06-15T22:48:08Z) - Multi-Head State Space Model for Speech Recognition [44.04124537862432]
State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks.
In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms.
As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus.
arXiv Detail & Related papers (2023-05-21T16:28:57Z) - Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace.
We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers.
Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.