Related papers: Block-State Transformers

Block-State Transformers

URL: http://arxiv.org/abs/2306.09539v4
Date: Mon, 30 Oct 2023 15:44:04 GMT
Title: Block-State Transformers
Authors: Mahan Fathi and Jonathan Pilault and Orhan Firat and Christopher Pal and Pierre-Luc Bacon and Ross Goroshin
Abstract summary: State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies. We propose a hybrid layer named Block-State Transformer (BST) that internally combines an SSM sublayer for long-range contextualization. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences.
Score: 41.57016890030355
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.

Related papers

Echo State Transformer: When chaos brings memory [2.07180164747172]
We introduce Echo State Transformers (EST), a hybrid architecture for sequential data processing.<n>EST integrates the Transformer attention mechanisms with Reservoir Computing principles to create a fixedsize window distributed memory system.<n>EST achieves constant computational complexity at each processing step, effectively breaking the quadratic scaling problem of standard Transformers.
arXiv Detail & Related papers (2025-06-25T09:56:25Z)
Enhanced Structured State Space Models via Grouped FIR Filtering and Attention Sink Mechanisms [0.6718184400443239]
We propose an advanced architecture that mitigates challenges by decomposing A-multiplications into multiple groups. Inspired by the "attention sink" phenomenon identified in streaming language models, we incorporate a similar mechanism to enhance the stability and performance of our model.
arXiv Detail & Related papers (2024-08-01T02:49:58Z)
Hierarchical Separable Video Transformer for Snapshot Compressive Imaging [46.23615648331571]
Hierarchical Separable Video Transformer (HiSViT) is a reconstruction architecture without temporal aggregation. HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network ( GSM-FFN) Our method outperforms previous methods by $!>!0.5$ with comparable or fewer parameters and complexity.
arXiv Detail & Related papers (2024-07-16T17:35:59Z)
Parallelizing Autoregressive Generation with Variational State Space Models [6.29143368345159]
We propose a variational autoencoder (VAE) where both the encoder and decoder are SSMs. Since sampling the latent variables and decoding them with the SSM can be parallelized, both training and generation can be conducted in parallel. The decoder recurrence allows generation to be resumed without reprocessing the whole sequence.
arXiv Detail & Related papers (2024-07-11T11:41:29Z)
MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module. We identify that a key weakness of such models is their inability to perform content-based reasoning. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba) As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z)
Convolutional State Space Models for Long-Range Spatiotemporal Modeling [65.0993000439043]
ConvS5 is an efficient variant for long-rangetemporal modeling. It significantly outperforms Transformers and ConvNISTTM on a long horizon Moving-Lab experiment while training 3X faster than ConvLSTM and generating samples 400X faster than Transformers.
arXiv Detail & Related papers (2023-10-30T16:11:06Z)
Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z)
Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence. Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.