IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision
Transformers
- URL: http://arxiv.org/abs/2106.12620v1
- Date: Wed, 23 Jun 2021 18:29:23 GMT
- Title: IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision
Transformers
- Authors: Bowen Pan, Yifan Jiang, Rameswar Panda, Zhangyang Wang, Rogerio Feris,
Aude Oliva
- Abstract summary: Self-attention-based model, transformer, is recently becoming the leading backbone in the field of computer vision.
We present an Interpretability-Aware REDundancy REDuction framework (IA-RED$2$)
We include extensive experiments on both image and video tasks, where our method could deliver up to 1.4X speed-up.
- Score: 81.31885548824926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The self-attention-based model, transformer, is recently becoming the leading
backbone in the field of computer vision. In spite of the impressive success
made by transformers in a variety of vision tasks, it still suffers from heavy
computation and intensive memory cost. To address this limitation, this paper
presents an Interpretability-Aware REDundancy REDuction framework (IA-RED$^2$).
We start by observing a large amount of redundant computation, mainly spent on
uncorrelated input patches, and then introduce an interpretable module to
dynamically and gracefully drop these redundant patches. This novel framework
is then extended to a hierarchical structure, where uncorrelated tokens at
different stages are gradually removed, resulting in a considerable shrinkage
of computational cost. We include extensive experiments on both image and video
tasks, where our method could deliver up to 1.4X speed-up for state-of-the-art
models like DeiT and TimeSformer, by only sacrificing less than 0.7% accuracy.
More importantly, contrary to other acceleration approaches, our method is
inherently interpretable with substantial visual evidence, making vision
transformer closer to a more human-understandable architecture while being
lighter. We demonstrate that the interpretability that naturally emerged in our
framework can outperform the raw attention learned by the original visual
transformer, as well as those generated by off-the-shelf interpretation
methods, with both qualitative and quantitative results. Project Page:
http://people.csail.mit.edu/bpan/ia-red/.
Related papers
- X-Pruner: eXplainable Pruning for Vision Transformers [12.296223124178102]
Vision transformer models usually suffer from intensive computational costs and heavy memory requirements.
Recent studies have proposed to prune transformers in an unexplainable manner, which overlook the relationship between internal units of the model and the target class.
We propose a novel explainable pruning framework dubbed X-Pruner, which is designed by considering the explainability of the pruning criterion.
arXiv Detail & Related papers (2023-03-08T23:10:18Z) - Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - Learning to Merge Tokens in Vision Transformers [22.029357721814044]
We present the PatchMerger, a module that reduces the number of patches or tokens the network has to process by merging them between two consecutive intermediate layers.
We show that the PatchMerger achieves a significant speedup across various model sizes while matching the original performance both upstream and downstream after fine-tuning.
arXiv Detail & Related papers (2022-02-24T10:56:17Z) - Learned Queries for Efficient Local Attention [11.123272845092611]
Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization.
We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner.
We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
arXiv Detail & Related papers (2021-12-21T18:52:33Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - Patch Slimming for Efficient Vision Transformers [107.21146699082819]
We study the efficiency problem for visual transformers by excavating redundant calculation in given networks.
We present a novel patch slimming approach that discards useless patches in a top-down paradigm.
Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of vision transformers.
arXiv Detail & Related papers (2021-06-05T09:46:00Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.