Related papers: Long Range Language Modeling via Gated State Spaces

Long Range Language Modeling via Gated State Spaces

URL: http://arxiv.org/abs/2206.13947v2
Date: Wed, 29 Jun 2022 18:47:29 GMT
Title: Long Range Language Modeling via Gated State Spaces
Authors: Harsh Mehta, Ankit Gupta, Ashok Cutkosky, Behnam Neyshabur
Abstract summary: We focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles. We propose a new layer named Gated State Space (GSS) and show that it trains significantly faster than the diagonal version of S4.
Score: 67.64091993846269
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State space models have shown to be effective at modeling long range dependencies, specially on sequence classification tasks. In this work we focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles. Based on recent developments around the effectiveness of gated activation functions, we propose a new layer named Gated State Space (GSS) and show that it trains significantly faster than the diagonal version of S4 (i.e. DSS) on TPUs, is fairly competitive with several well-tuned Transformer-based baselines and exhibits zero-shot generalization to longer inputs while being straightforward to implement. Finally, we show that leveraging self-attention to model local dependencies improves the performance of GSS even further.

Related papers

Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
DyG-Mamba: Continuous State Space Modeling on Dynamic Graphs [59.434893231950205]
Dynamic graph learning aims to uncover evolutionary laws in real-world systems. We propose DyG-Mamba, a new continuous state space model for dynamic graph learning. We show that DyG-Mamba achieves state-of-the-art performance on most datasets.
arXiv Detail & Related papers (2024-08-13T15:21:46Z)
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences. We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook. LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z)
Theoretical Foundations of Deep Selective State-Space Models [13.971499161967083]
Deep SSMs demonstrate outstanding performance across a diverse set of domains. Recent developments show that if the linear recurrence powering SSMs allows for multiplicative interactions between inputs and hidden states. We show that when random linear recurrences are equipped with simple input-controlled transitions, then the hidden state is provably a low-dimensional projection of a powerful mathematical object.
arXiv Detail & Related papers (2024-02-29T11:20:16Z)
Sparse Modular Activation for Efficient Sequence Modeling [94.11125833685583]
Recent models combining Linear State Space Models with self-attention mechanisms have demonstrated impressive results across a range of sequence modeling tasks. Current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. We introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely activate sub-modules for sequence elements in a differentiable manner.
arXiv Detail & Related papers (2023-06-19T23:10:02Z)
Structured State Space Models for In-Context Reinforcement Learning [30.189834820419446]
Structured state space sequence (S4) models have recently achieved state-of-the-art performance on long-range sequence modeling tasks. We propose a modification to a variant of S4 that enables us to initialise and reset the hidden state in parallel. We show that our modified architecture runs faster than Transformers in sequence length and performs better than RNN's on a simple memory-based task.
arXiv Detail & Related papers (2023-03-07T15:32:18Z)
Deep Latent State Space Models for Time-Series Generation [68.45746489575032]
We propose LS4, a generative model for sequences with latent variables evolving according to a state space ODE. Inspired by recent deep state space models (S4), we achieve speedups by leveraging a convolutional representation of LS4. We show that LS4 significantly outperforms previous continuous-time generative models in terms of marginal distribution, classification, and prediction scores on real-world datasets.
arXiv Detail & Related papers (2022-12-24T15:17:42Z)
Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace. We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z)
Diagonal State Spaces are as Effective as Structured State Spaces [3.8276199743296906]
We show that our $textitDiagonal State Space$ (DSS) model matches the performance of S4 on Long Range Arena tasks, speech classification on Speech Commands dataset, while being conceptually simpler and straightforward to implement. In this work, we show that one can match the performance of S4 even without the low rank correction and thus assuming the state matrices to be diagonal.
arXiv Detail & Related papers (2022-03-27T16:30:33Z)
Efficiently Modeling Long Sequences with Structured State Spaces [15.456254157293836]
We propose a new sequence model based on a new parameterization for the fundamental state space model. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet.
arXiv Detail & Related papers (2021-10-31T03:32:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.