Searching Intrinsic Dimensions of Vision Transformers
- URL: http://arxiv.org/abs/2204.07722v1
- Date: Sat, 16 Apr 2022 05:16:35 GMT
- Title: Searching Intrinsic Dimensions of Vision Transformers
- Authors: Fanghui Xue, Biao Yang, Yingyong Qi and Jack Xin
- Abstract summary: We propose SiDT, a method for pruning vision transformer backbones on more complicated vision tasks like object detection.
Experiments on CIFAR-100 and COCO datasets show that the backbones with 20% or 40% dimensions/ parameters pruned can have similar or even better performance than the unpruned models.
- Score: 6.004704152622424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It has been shown by many researchers that transformers perform as well as
convolutional neural networks in many computer vision tasks. Meanwhile, the
large computational costs of its attention module hinder further studies and
applications on edge devices. Some pruning methods have been developed to
construct efficient vision transformers, but most of them have considered image
classification tasks only. Inspired by these results, we propose SiDT, a method
for pruning vision transformer backbones on more complicated vision tasks like
object detection, based on the search of transformer dimensions. Experiments on
CIFAR-100 and COCO datasets show that the backbones with 20\% or 40\%
dimensions/parameters pruned can have similar or even better performance than
the unpruned models. Moreover, we have also provided the complexity analysis
and comparisons with the previous pruning methods.
Related papers
- Dynamic Grained Encoder for Vision Transformers [150.02797954201424]
This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images.
We propose a Dynamic Grained for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
Our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2023-01-10T07:55:29Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.