Token Recycling for Efficient Sequential Inference with Vision
  Transformers
        - URL: http://arxiv.org/abs/2311.15335v1
- Date: Sun, 26 Nov 2023 15:39:57 GMT
- Title: Token Recycling for Efficient Sequential Inference with Vision
  Transformers
- Authors: Jan Olszewski and Dawid Rymarczyk and Piotr W\'ojcik and Mateusz Pach
  and Bartosz Zieli\'nski
- Abstract summary: Vision Transformers (ViTs) overpass Convolutional Neural Networks in processing incomplete inputs because they do not require the imputation of missing values.
ViTs are computationally inefficient because they perform a full forward pass each time a piece of new sequential information arrives.
We introduce the TOken REcycling (TORE) modification for the ViT inference, which can be used with any architecture.
- Score: 3.9906557901972897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Vision Transformers (ViTs) overpass Convolutional Neural Networks in
processing incomplete inputs because they do not require the imputation of
missing values. Therefore, ViTs are well suited for sequential decision-making,
e.g. in the Active Visual Exploration problem. However, they are
computationally inefficient because they perform a full forward pass each time
a piece of new sequential information arrives.
  To reduce this computational inefficiency, we introduce the TOken REcycling
(TORE) modification for the ViT inference, which can be used with any
architecture. TORE divides ViT into two parts, iterator and aggregator. An
iterator processes sequential information separately into midway tokens, which
are cached. The aggregator processes midway tokens jointly to obtain the
prediction. This way, we can reuse the results of computations made by
iterator.
  Except for efficient sequential inference, we propose a complementary
training policy, which significantly reduces the computational burden
associated with sequential decision-making while achieving state-of-the-art
accuracy.
 
      
        Related papers
        - LookWhere? Efficient Visual Recognition by Learning Where to Look and   What to See from Self-Supervision [10.461453853510964]
 Vision transformers are ever larger, more accurate, and more expensive to compute.<n>We turn to adaptive computation to cope with this cost by learning to predict where to compute.<n>Our LookWhere method divides a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input.
 arXiv  Detail & Related papers  (2025-05-23T15:56:35Z)
- Image Coding for Machines via Feature-Preserving Rate-Distortion   Optimization [27.97760974010369]
 We show an approach to reduce the effect of compression on a task loss using the distance between features as a distortion metric.
We simplify the RDO formulation to make the distortion term computable using block-based encoders.
We show up to 10% bit-rate savings for the same computer vision accuracy compared to RDO based on SSE.
 arXiv  Detail & Related papers  (2025-04-03T02:11:26Z)
- FullTransNet: Full Transformer with Local-Global Attention for Video   Summarization [16.134118247239527]
 We propose a transformer-like architecture named FullTransNet for video summarization.<n>It uses a full transformer with an encoder-decoder structure as an alternative architecture for video summarization.<n>Our model achieves F-scores of 54.4% and 63.9%, respectively, while maintaining relatively low computational and memory requirements.
 arXiv  Detail & Related papers  (2025-01-01T16:07:27Z)
- Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
 We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
 arXiv  Detail & Related papers  (2023-10-03T08:44:50Z)
- InvKA: Gait Recognition via Invertible Koopman Autoencoder [15.718065380333718]
 Most gait recognition methods suffer from poor interpretability and high computational cost.
To improve interpretability, we investigate gait features in the embedding space based on Koopman operator theory.
To reduce the computational cost of our algorithm, we use a reversible autoencoder to reduce the model size and eliminate convolutional layers.
 arXiv  Detail & Related papers  (2023-09-26T08:53:54Z)
- Eventful Transformers: Leveraging Temporal Redundancy in Vision
  Transformers [27.029600581635957]
 We describe a method for identifying and re-processing only those tokens that have changed significantly over time.
We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100)
 arXiv  Detail & Related papers  (2023-08-25T17:10:12Z)
- Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
 We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention.
The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
 arXiv  Detail & Related papers  (2023-05-24T03:47:22Z)
- Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
 Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
 arXiv  Detail & Related papers  (2023-01-05T18:59:52Z)
- Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing
  Mechanisms in Sequence Learning [85.95599675484341]
 Recurrent neural networks have a strong inductive bias towards learning temporally compressed representations.
Transformers have little inductive bias towards learning temporally compressed representations.
 arXiv  Detail & Related papers  (2022-05-30T00:12:33Z)
- Token Pooling in Vision Transformers [37.11990688046186]
 In vision transformers, self-attention is not the major bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers.
We propose a novel token downsampling method, called Token Pooling, efficiently exploiting redundancies in the images and intermediate token representations.
Our experiments show that Token Pooling significantly improves the cost-accuracy trade-off over the state-of-the-art downsampling.
 arXiv  Detail & Related papers  (2021-10-08T02:22:50Z)
- TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
 We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining.
We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time.
Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
 arXiv  Detail & Related papers  (2021-04-16T17:55:28Z)
- Decoupling Representation Learning from Reinforcement Learning [89.82834016009461]
 We introduce an unsupervised learning task called Augmented Temporal Contrast (ATC)
 ATC trains a convolutional encoder to associate pairs of observations separated by a short time difference.
In online RL experiments, we show that training the encoder exclusively using ATC matches or outperforms end-to-end RL.
 arXiv  Detail & Related papers  (2020-09-14T19:11:13Z)
- Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
  Language Processing [112.2208052057002]
 We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
 arXiv  Detail & Related papers  (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.