Related papers: Staircase Attention for Recurrent Processing of Sequences

Staircase Attention for Recurrent Processing of Sequences

URL: http://arxiv.org/abs/2106.04279v1
Date: Tue, 8 Jun 2021 12:19:31 GMT
Title: Staircase Attention for Recurrent Processing of Sequences
Authors: Da Ju, Stephen Roller, Sainbayar Sukhbaatar, Jason Weston
Abstract summary: Staircase attention operates across the sequence (in time) recurrently processing the input by adding another step of processing. It is shown to be able to solve tasks that involve tracking that conventional Transformers cannot, due to this recurrence. It is shown to provide improved modeling power for the same size model (number of parameters) compared to self-attentive Transformers on large language modeling and dialogue tasks, yielding significant perplexity gains.
Score: 34.53670631387504
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention mechanisms have become a standard tool for sequence modeling tasks, in particular by stacking self-attention layers over the entire input sequence as in the Transformer architecture. In this work we introduce a novel attention procedure called staircase attention that, unlike self-attention, operates across the sequence (in time) recurrently processing the input by adding another step of processing. A step in the staircase comprises of backward tokens (encoding the sequence so far seen) and forward tokens (ingesting a new part of the sequence), or an extreme Ladder version with a forward step of zero that simply repeats the Transformer on each step of the ladder, sharing the weights. We thus describe a family of such models that can trade off performance and compute, by either increasing the amount of recurrence through time, the amount of sequential processing via recurrence in depth, or both. Staircase attention is shown to be able to solve tasks that involve tracking that conventional Transformers cannot, due to this recurrence. Further, it is shown to provide improved modeling power for the same size model (number of parameters) compared to self-attentive Transformers on large language modeling and dialogue tasks, yielding significant perplexity gains.

Related papers

Small transformer architectures for task switching [2.7195102129095003]
It is non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches.<n>We show that standard transformers cannot solve a basic task switching reference model.<n>We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies.
arXiv Detail & Related papers (2025-08-06T14:01:05Z)
Continual Low-Rank Scaled Dot-product Attention [67.11704350478475]
We introduce a new formulation of the Scaled-product Attention based on the Nystr"om approximation that is suitable for Continual Inference. In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude.
arXiv Detail & Related papers (2024-12-04T11:05:01Z)
Looking Beyond The Top-1: Transformers Determine Top Tokens In Order [13.032106683136394]
We analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed. We find that these saturation events happen in order of the corresponding tokens' ranking. We propose an underlying mechanism of task transition for this sequential saturation.
arXiv Detail & Related papers (2024-10-26T16:00:38Z)
Harnessing Attention Mechanisms: Efficient Sequence Reduction using Attention-based Autoencoders [14.25761027376296]
We introduce a novel attention-based method that allows for the direct manipulation of sequence lengths. We show that the autoencoder retains all the significant information when reducing the original sequence to half its original size.
arXiv Detail & Related papers (2023-10-23T11:57:44Z)
Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z)
Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers [24.109312575970456]
We propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences. Our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps. We learn an effective hidden selection policy, which regards the decoders of transformers as environments.
arXiv Detail & Related papers (2023-08-25T05:52:05Z)
SVIP: Sequence VerIfication for Procedures in Videos [68.07865790764237]
We propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations. Such a challenging task resides in an open-set setting without prior action detection or segmentation. We collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments.
arXiv Detail & Related papers (2021-12-13T07:03:36Z)
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR) Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one. With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. We propose the Feedback Transformer architecture that exposes all previous representations to all future representations. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.