Continual Transformers: Redundancy-Free Attention for Online Inference
- URL: http://arxiv.org/abs/2201.06268v1
- Date: Mon, 17 Jan 2022 08:20:09 GMT
- Title: Continual Transformers: Redundancy-Free Attention for Online Inference
- Authors: Lukas Hedegaard and Arian Bakhtiarnia and Alexandros Iosifidis
- Abstract summary: We propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference in a continual input stream.
Our modification is purely to the order of computations, while the produced outputs and learned weights are identical to those of the original Multi-Head Attention.
- Score: 86.3361797111839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are attention-based sequence transduction models, which have
found widespread success in Natural Language Processing and Computer Vision
applications. Yet, Transformers in their current form are inherently limited to
operate on whole token sequences rather than on one token at a time.
Consequently, their use during online inference entails considerable redundancy
due to the overlap in successive token sequences. In this work, we propose
novel formulations of the Scaled Dot-Product Attention, which enable
Transformers to perform efficient online token-by-token inference in a
continual input stream. Importantly, our modification is purely to the order of
computations, while the produced outputs and learned weights are identical to
those of the original Multi-Head Attention. To validate our approach, we
conduct experiments on visual, audio, and audio-visual classification and
detection tasks, i.e. Online Action Detection on THUMOS14 and TVSeries and
Online Audio Classification on GTZAN, with remarkable results. Our continual
one-block transformers reduce the floating point operations by respectively
63.5x and 51.5x in the Online Action Detection and Audio Classification
experiments at similar predictive performance.
Related papers
- Mitigating Over-smoothing in Transformers via Regularized Nonlocal
Functionals [31.328766460487355]
We show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity.
We propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens.
We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations.
arXiv Detail & Related papers (2023-12-01T17:52:47Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action
Spotting using Transformers [1.894259749028573]
We present COMEDIAN, a novel pipeline to initialize transformers for action spotting.
Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.
arXiv Detail & Related papers (2023-09-03T20:50:53Z) - Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers [24.109312575970456]
We propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences.
Our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps.
We learn an effective hidden selection policy, which regards the decoders of transformers as environments.
arXiv Detail & Related papers (2023-08-25T05:52:05Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Transformers in Action:Weakly Supervised Action Segmentation [81.18941007536468]
We show how to apply transformers to improve action alignment accuracy over the equivalent RNN-based models.
We also propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time.
We evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers.
arXiv Detail & Related papers (2022-01-14T21:15:58Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.