Reversible Vision Transformers
- URL: http://arxiv.org/abs/2302.04869v1
- Date: Thu, 9 Feb 2023 18:59:54 GMT
- Title: Reversible Vision Transformers
- Authors: Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan Wu, Bo Xiong,
Christoph Feichtenhofer, Jitendra Malik
- Abstract summary: Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
- Score: 74.3500977090597
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Reversible Vision Transformers, a memory efficient architecture
design for visual recognition. By decoupling the GPU memory requirement from
the depth of the model, Reversible Vision Transformers enable scaling up
architectures with efficient memory usage. We adapt two popular models, namely
Vision Transformer and Multiscale Vision Transformers, to reversible variants
and benchmark extensively across both model sizes and tasks of image
classification, object detection and video classification. Reversible Vision
Transformers achieve a reduced memory footprint of up to 15.5x at roughly
identical model complexity, parameters and accuracy, demonstrating the promise
of reversible vision transformers as an efficient backbone for hardware
resource limited training regimes. Finally, we find that the additional
computational burden of recomputing activations is more than overcome for
deeper models, where throughput can increase up to 2.3x over their
non-reversible counterparts. Full code and trained models are available at
https://github.com/facebookresearch/slowfast. A simpler, easy to understand and
modify version is also available at https://github.com/karttikeya/minREV
Related papers
- Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z) - Vis-TOP: Visual Transformer Overlay Processor [9.80151619872144]
Transformer has achieved good results in Natural Language Processing (NLP) and has also started to expand into Computer Vision (CV)
We propose Vis-TOP, an overlay processor for various visual Transformer models.
Vis-TOP summarizes the characteristics of all visual Transformer models and implements a three-layer and two-level transformation structure.
arXiv Detail & Related papers (2021-10-21T08:11:12Z) - Vision Xformers: Efficient Attention for Image Classification [0.0]
We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers.
We show that ViX performs better than ViT in image classification consuming lesser computing resources.
arXiv Detail & Related papers (2021-07-05T19:24:23Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Multiscale Vision Transformers [79.76412415996892]
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models.
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks.
arXiv Detail & Related papers (2021-04-22T17:59:45Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.