TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration
- URL: http://arxiv.org/abs/2311.15335v2
- Date: Mon, 25 Nov 2024 07:08:38 GMT
- Title: TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration
- Authors: Jan Olszewski, Dawid Rymarczyk, Piotr Wójcik, Mateusz Pach, Bartosz Zieliński,
- Abstract summary: Active Visual Exploration (AVE) optimize the utilization of robotic resources in real-world scenarios by sequentially selecting the most informative observations.
We introduce a novel approach to AVE called TOken REcycling (TORE)
It divides the encoder into extractor and aggregator components. The extractor processes each observation separately, enabling the reuse of tokens passed to the aggregator.
- Score: 2.177039289023855
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Active Visual Exploration (AVE) optimizes the utilization of robotic resources in real-world scenarios by sequentially selecting the most informative observations. However, modern methods require a high computational budget due to processing the same observations multiple times through the autoencoder transformers. As a remedy, we introduce a novel approach to AVE called TOken REcycling (TORE). It divides the encoder into extractor and aggregator components. The extractor processes each observation separately, enabling the reuse of tokens passed to the aggregator. Moreover, to further reduce the computations, we decrease the decoder to only one block. Through extensive experiments, we demonstrate that TORE outperforms state-of-the-art methods while reducing computational overhead by up to 90\%.
Related papers
- LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision [10.461453853510964]
Vision transformers are ever larger, more accurate, and more expensive to compute.<n>We turn to adaptive computation to cope with this cost by learning to predict where to compute.<n>Our LookWhere method divides a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input.
arXiv Detail & Related papers (2025-05-23T15:56:35Z) - Image Coding for Machines via Feature-Preserving Rate-Distortion Optimization [27.97760974010369]
We show an approach to reduce the effect of compression on a task loss using the distance between features as a distortion metric.
We simplify the RDO formulation to make the distortion term computable using block-based encoders.
We show up to 10% bit-rate savings for the same computer vision accuracy compared to RDO based on SSE.
arXiv Detail & Related papers (2025-04-03T02:11:26Z) - FullTransNet: Full Transformer with Local-Global Attention for Video Summarization [16.134118247239527]
We propose a transformer-like architecture named FullTransNet for video summarization.<n>It uses a full transformer with an encoder-decoder structure as an alternative architecture for video summarization.<n>Our model achieves F-scores of 54.4% and 63.9%, respectively, while maintaining relatively low computational and memory requirements.
arXiv Detail & Related papers (2025-01-01T16:07:27Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - InvKA: Gait Recognition via Invertible Koopman Autoencoder [15.718065380333718]
Most gait recognition methods suffer from poor interpretability and high computational cost.
To improve interpretability, we investigate gait features in the embedding space based on Koopman operator theory.
To reduce the computational cost of our algorithm, we use a reversible autoencoder to reduce the model size and eliminate convolutional layers.
arXiv Detail & Related papers (2023-09-26T08:53:54Z) - Eventful Transformers: Leveraging Temporal Redundancy in Vision
Transformers [27.029600581635957]
We describe a method for identifying and re-processing only those tokens that have changed significantly over time.
We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100)
arXiv Detail & Related papers (2023-08-25T17:10:12Z) - Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention.
The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
arXiv Detail & Related papers (2023-05-24T03:47:22Z) - Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z) - Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing
Mechanisms in Sequence Learning [85.95599675484341]
Recurrent neural networks have a strong inductive bias towards learning temporally compressed representations.
Transformers have little inductive bias towards learning temporally compressed representations.
arXiv Detail & Related papers (2022-05-30T00:12:33Z) - Token Pooling in Vision Transformers [37.11990688046186]
In vision transformers, self-attention is not the major bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers.
We propose a novel token downsampling method, called Token Pooling, efficiently exploiting redundancies in the images and intermediate token representations.
Our experiments show that Token Pooling significantly improves the cost-accuracy trade-off over the state-of-the-art downsampling.
arXiv Detail & Related papers (2021-10-08T02:22:50Z) - TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining.
We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time.
Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z) - Decoupling Representation Learning from Reinforcement Learning [89.82834016009461]
We introduce an unsupervised learning task called Augmented Temporal Contrast (ATC)
ATC trains a convolutional encoder to associate pairs of observations separated by a short time difference.
In online RL experiments, we show that training the encoder exclusively using ATC matches or outperforms end-to-end RL.
arXiv Detail & Related papers (2020-09-14T19:11:13Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.