Token Recycling for Efficient Sequential Inference with Vision
Transformers
- URL: http://arxiv.org/abs/2311.15335v1
- Date: Sun, 26 Nov 2023 15:39:57 GMT
- Title: Token Recycling for Efficient Sequential Inference with Vision
Transformers
- Authors: Jan Olszewski and Dawid Rymarczyk and Piotr W\'ojcik and Mateusz Pach
and Bartosz Zieli\'nski
- Abstract summary: Vision Transformers (ViTs) overpass Convolutional Neural Networks in processing incomplete inputs because they do not require the imputation of missing values.
ViTs are computationally inefficient because they perform a full forward pass each time a piece of new sequential information arrives.
We introduce the TOken REcycling (TORE) modification for the ViT inference, which can be used with any architecture.
- Score: 3.9906557901972897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) overpass Convolutional Neural Networks in
processing incomplete inputs because they do not require the imputation of
missing values. Therefore, ViTs are well suited for sequential decision-making,
e.g. in the Active Visual Exploration problem. However, they are
computationally inefficient because they perform a full forward pass each time
a piece of new sequential information arrives.
To reduce this computational inefficiency, we introduce the TOken REcycling
(TORE) modification for the ViT inference, which can be used with any
architecture. TORE divides ViT into two parts, iterator and aggregator. An
iterator processes sequential information separately into midway tokens, which
are cached. The aggregator processes midway tokens jointly to obtain the
prediction. This way, we can reuse the results of computations made by
iterator.
Except for efficient sequential inference, we propose a complementary
training policy, which significantly reduces the computational burden
associated with sequential decision-making while achieving state-of-the-art
accuracy.
Related papers
- ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers [0.0]
Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection.
We propose to cluster the transformer input on the basis of its entropy.
Clustering reduces the size of data given as input to the transformer and therefore reduces training time and GPU memory usage.
arXiv Detail & Related papers (2024-09-11T18:03:59Z) - Sharing Key Semantics in Transformer Makes Efficient Image Restoration [148.22790334216117]
Self-attention mechanism, a cornerstone of Vision Transformers (ViTs) tends to encompass all global cues, even those from semantically unrelated objects or regions.
We propose boosting Image Restoration's performance by sharing the key semantics via Transformer for IR (i.e., SemanIR) in this paper.
arXiv Detail & Related papers (2024-05-30T12:45:34Z) - GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation [30.343504537684755]
Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging.
To expedite ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in computation.
We introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs.
arXiv Detail & Related papers (2023-11-06T11:14:19Z) - Incrementally-Computable Neural Networks: Efficient Inference for
Dynamic Inputs [75.40636935415601]
Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs.
We take an incremental computing approach, looking to reuse calculations as the inputs change.
We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of modified inputs.
arXiv Detail & Related papers (2023-07-27T16:30:27Z) - Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention.
The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
arXiv Detail & Related papers (2023-05-24T03:47:22Z) - TAPIR: Learning Adaptive Revision for Incremental Natural Language
Understanding with a Two-Pass Model [14.846377138993645]
Recent neural network-based approaches for incremental processing mainly use RNNs or Transformers.
A restart-incremental interface that repeatedly passes longer input prefixes can be used to obtain partial outputs, while providing the ability to revise.
We propose the Two-pass model for AdaPtIve Revision (TAPIR) and introduce a method to obtain an incremental supervision signal for learning an adaptive revision policy.
arXiv Detail & Related papers (2023-05-18T09:58:19Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing
Mechanisms in Sequence Learning [85.95599675484341]
Recurrent neural networks have a strong inductive bias towards learning temporally compressed representations.
Transformers have little inductive bias towards learning temporally compressed representations.
arXiv Detail & Related papers (2022-05-30T00:12:33Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z) - Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.