Staircase Attention for Recurrent Processing of Sequences
- URL: http://arxiv.org/abs/2106.04279v1
- Date: Tue, 8 Jun 2021 12:19:31 GMT
- Title: Staircase Attention for Recurrent Processing of Sequences
- Authors: Da Ju, Stephen Roller, Sainbayar Sukhbaatar, Jason Weston
- Abstract summary: Staircase attention operates across the sequence (in time) recurrently processing the input by adding another step of processing.
It is shown to be able to solve tasks that involve tracking that conventional Transformers cannot, due to this recurrence.
It is shown to provide improved modeling power for the same size model (number of parameters) compared to self-attentive Transformers on large language modeling and dialogue tasks, yielding significant perplexity gains.
- Score: 34.53670631387504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention mechanisms have become a standard tool for sequence modeling tasks,
in particular by stacking self-attention layers over the entire input sequence
as in the Transformer architecture. In this work we introduce a novel attention
procedure called staircase attention that, unlike self-attention, operates
across the sequence (in time) recurrently processing the input by adding
another step of processing. A step in the staircase comprises of backward
tokens (encoding the sequence so far seen) and forward tokens (ingesting a new
part of the sequence), or an extreme Ladder version with a forward step of zero
that simply repeats the Transformer on each step of the ladder, sharing the
weights. We thus describe a family of such models that can trade off
performance and compute, by either increasing the amount of recurrence through
time, the amount of sequential processing via recurrence in depth, or both.
Staircase attention is shown to be able to solve tasks that involve tracking
that conventional Transformers cannot, due to this recurrence. Further, it is
shown to provide improved modeling power for the same size model (number of
parameters) compared to self-attentive Transformers on large language modeling
and dialogue tasks, yielding significant perplexity gains.
Related papers
- Looking Beyond The Top-1: Transformers Determine Top Tokens In Order [13.032106683136394]
We analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed.
We find that these saturation events happen in order of the corresponding tokens' ranking.
We propose an underlying mechanism of task transition for this sequential saturation.
arXiv Detail & Related papers (2024-10-26T16:00:38Z) - Harnessing Attention Mechanisms: Efficient Sequence Reduction using
Attention-based Autoencoders [14.25761027376296]
We introduce a novel attention-based method that allows for the direct manipulation of sequence lengths.
We show that the autoencoder retains all the significant information when reducing the original sequence to half its original size.
arXiv Detail & Related papers (2023-10-23T11:57:44Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers [24.109312575970456]
We propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences.
Our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps.
We learn an effective hidden selection policy, which regards the decoders of transformers as environments.
arXiv Detail & Related papers (2023-08-25T05:52:05Z) - SVIP: Sequence VerIfication for Procedures in Videos [68.07865790764237]
We propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations.
Such a challenging task resides in an open-set setting without prior action detection or segmentation.
We collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments.
arXiv Detail & Related papers (2021-12-13T07:03:36Z) - Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR)
Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors.
It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.