Dynamic Grained Encoder for Vision Transformers
- URL: http://arxiv.org/abs/2301.03831v1
- Date: Tue, 10 Jan 2023 07:55:29 GMT
- Title: Dynamic Grained Encoder for Vision Transformers
- Authors: Lin Song, Songyang Zhang, Songtao Liu, Zeming Li, Xuming He, Hongbin
Sun, Jian Sun, Nanning Zheng
- Abstract summary: This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images.
We propose a Dynamic Grained for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
Our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification.
- Score: 150.02797954201424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers, the de-facto standard for language modeling, have been recently
applied for vision tasks. This paper introduces sparse queries for vision
transformers to exploit the intrinsic spatial redundancy of natural images and
save computational costs. Specifically, we propose a Dynamic Grained Encoder
for vision transformers, which can adaptively assign a suitable number of
queries to each spatial region. Thus it achieves a fine-grained representation
in discriminative regions while keeping high efficiency. Besides, the dynamic
grained encoder is compatible with most vision transformer frameworks. Without
bells and whistles, our encoder allows the state-of-the-art vision transformers
to reduce computational complexity by 40%-60% while maintaining comparable
performance on image classification. Extensive experiments on object detection
and segmentation further demonstrate the generalizability of our approach. Code
is available at https://github.com/StevenGrove/vtpack.
Related papers
- Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual
Grounding [27.568879624013576]
Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding.
Existing encoder-only grounding framework suffers from heavy computation due to the self-attention operation with quadratic time complexity.
We present Dynamic Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases.
arXiv Detail & Related papers (2022-09-28T09:43:02Z) - Searching Intrinsic Dimensions of Vision Transformers [6.004704152622424]
We propose SiDT, a method for pruning vision transformer backbones on more complicated vision tasks like object detection.
Experiments on CIFAR-100 and COCO datasets show that the backbones with 20% or 40% dimensions/ parameters pruned can have similar or even better performance than the unpruned models.
arXiv Detail & Related papers (2022-04-16T05:16:35Z) - Fully Transformer Networks for Semantic ImageSegmentation [26.037770622551882]
We explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN)
We propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT)
Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation.
arXiv Detail & Related papers (2021-06-08T05:15:28Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Multiscale Vision Transformers [79.76412415996892]
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models.
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks.
arXiv Detail & Related papers (2021-04-22T17:59:45Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.