Continual Transformers: Redundancy-Free Attention for Online Inference
- URL: http://arxiv.org/abs/2201.06268v1
- Date: Mon, 17 Jan 2022 08:20:09 GMT
- Title: Continual Transformers: Redundancy-Free Attention for Online Inference
- Authors: Lukas Hedegaard and Arian Bakhtiarnia and Alexandros Iosifidis
- Abstract summary: We propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference in a continual input stream.
Our modification is purely to the order of computations, while the produced outputs and learned weights are identical to those of the original Multi-Head Attention.
- Score: 86.3361797111839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are attention-based sequence transduction models, which have
found widespread success in Natural Language Processing and Computer Vision
applications. Yet, Transformers in their current form are inherently limited to
operate on whole token sequences rather than on one token at a time.
Consequently, their use during online inference entails considerable redundancy
due to the overlap in successive token sequences. In this work, we propose
novel formulations of the Scaled Dot-Product Attention, which enable
Transformers to perform efficient online token-by-token inference in a
continual input stream. Importantly, our modification is purely to the order of
computations, while the produced outputs and learned weights are identical to
those of the original Multi-Head Attention. To validate our approach, we
conduct experiments on visual, audio, and audio-visual classification and
detection tasks, i.e. Online Action Detection on THUMOS14 and TVSeries and
Online Audio Classification on GTZAN, with remarkable results. Our continual
one-block transformers reduce the floating point operations by respectively
63.5x and 51.5x in the Online Action Detection and Audio Classification
experiments at similar predictive performance.
Related papers
- Continual Low-Rank Scaled Dot-product Attention [67.11704350478475]
We introduce a new formulation of the Scaled-product Attention based on the Nystr"om approximation that is suitable for Continual Inference.
In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude.
arXiv Detail & Related papers (2024-12-04T11:05:01Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action
Spotting using Transformers [1.894259749028573]
We present COMEDIAN, a novel pipeline to initialize transformers for action spotting.
Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.
arXiv Detail & Related papers (2023-09-03T20:50:53Z) - Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers [24.109312575970456]
We propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences.
Our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps.
We learn an effective hidden selection policy, which regards the decoders of transformers as environments.
arXiv Detail & Related papers (2023-08-25T05:52:05Z) - Transformers in Action:Weakly Supervised Action Segmentation [81.18941007536468]
We show how to apply transformers to improve action alignment accuracy over the equivalent RNN-based models.
We also propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time.
We evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers.
arXiv Detail & Related papers (2022-01-14T21:15:58Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.