Related papers: Token Recycling for Efficient Sequential Inference with Vision Transformers

Token Recycling for Efficient Sequential Inference with Vision Transformers

URL: http://arxiv.org/abs/2311.15335v1
Date: Sun, 26 Nov 2023 15:39:57 GMT
Title: Token Recycling for Efficient Sequential Inference with Vision Transformers
Authors: Jan Olszewski and Dawid Rymarczyk and Piotr W\'ojcik and Mateusz Pach and Bartosz Zieli\'nski
Abstract summary: Vision Transformers (ViTs) overpass Convolutional Neural Networks in processing incomplete inputs because they do not require the imputation of missing values. ViTs are computationally inefficient because they perform a full forward pass each time a piece of new sequential information arrives. We introduce the TOken REcycling (TORE) modification for the ViT inference, which can be used with any architecture.
Score: 3.9906557901972897
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers (ViTs) overpass Convolutional Neural Networks in processing incomplete inputs because they do not require the imputation of missing values. Therefore, ViTs are well suited for sequential decision-making, e.g. in the Active Visual Exploration problem. However, they are computationally inefficient because they perform a full forward pass each time a piece of new sequential information arrives. To reduce this computational inefficiency, we introduce the TOken REcycling (TORE) modification for the ViT inference, which can be used with any architecture. TORE divides ViT into two parts, iterator and aggregator. An iterator processes sequential information separately into midway tokens, which are cached. The aggregator processes midway tokens jointly to obtain the prediction. This way, we can reuse the results of computations made by iterator. Except for efficient sequential inference, we propose a complementary training policy, which significantly reduces the computational burden associated with sequential decision-making while achieving state-of-the-art accuracy.

Related papers

Image Coding for Machines via Feature-Preserving Rate-Distortion Optimization [27.97760974010369]
We show an approach to reduce the effect of compression on a task loss using the distance between features as a distortion metric. We simplify the RDO formulation to make the distortion term computable using block-based encoders. We show up to 10% bit-rate savings for the same computer vision accuracy compared to RDO based on SSE.
arXiv Detail & Related papers (2025-04-03T02:11:26Z)
Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z)
InvKA: Gait Recognition via Invertible Koopman Autoencoder [15.718065380333718]
Most gait recognition methods suffer from poor interpretability and high computational cost. To improve interpretability, we investigate gait features in the embedding space based on Koopman operator theory. To reduce the computational cost of our algorithm, we use a reversible autoencoder to reduce the model size and eliminate convolutional layers.
arXiv Detail & Related papers (2023-09-26T08:53:54Z)
Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers [27.029600581635957]
We describe a method for identifying and re-processing only those tokens that have changed significantly over time. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100)
arXiv Detail & Related papers (2023-08-25T17:10:12Z)
Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention. The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
arXiv Detail & Related papers (2023-05-24T03:47:22Z)
Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer. We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z)
Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence Learning [85.95599675484341]
Recurrent neural networks have a strong inductive bias towards learning temporally compressed representations. Transformers have little inductive bias towards learning temporally compressed representations.
arXiv Detail & Related papers (2022-05-30T00:12:33Z)
Token Pooling in Vision Transformers [37.11990688046186]
In vision transformers, self-attention is not the major bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers. We propose a novel token downsampling method, called Token Pooling, efficiently exploiting redundancies in the images and intermediate token representations. Our experiments show that Token Pooling significantly improves the cost-accuracy trade-off over the state-of-the-art downsampling.
arXiv Detail & Related papers (2021-10-08T02:22:50Z)
TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining. We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time. Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z)
Decoupling Representation Learning from Reinforcement Learning [89.82834016009461]
We introduce an unsupervised learning task called Augmented Temporal Contrast (ATC) ATC trains a convolutional encoder to associate pairs of observations separated by a short time difference. In online RL experiments, we show that training the encoder exclusively using ATC matches or outperforms end-to-end RL.
arXiv Detail & Related papers (2020-09-14T19:11:13Z)
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one. With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.