Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction
without Convolutions
- URL: http://arxiv.org/abs/2102.12122v1
- Date: Wed, 24 Feb 2021 08:33:55 GMT
- Title: Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction
without Convolutions
- Authors: Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding
Liang, Tong Lu, Ping Luo, Ling Shao
- Abstract summary: This work investigates a simple backbone network useful for many dense prediction tasks without convolutions.
Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer(PVT)
PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions.
- Score: 103.03973037619532
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although using convolutional neural networks (CNNs) as backbones achieves
great successes in computer vision, this work investigates a simple backbone
network useful for many dense prediction tasks without convolutions. Unlike the
recently-proposed Transformer model (e.g., ViT) that is specially designed for
image classification, we propose Pyramid Vision Transformer~(PVT), which
overcomes the difficulties of porting Transformer to various dense prediction
tasks. PVT has several merits compared to prior arts. (1) Different from ViT
that typically has low-resolution outputs and high computational and memory
cost, PVT can be not only trained on dense partitions of the image to achieve
high output resolution, which is important for dense predictions but also using
a progressive shrinking pyramid to reduce computations of large feature maps.
(2) PVT inherits the advantages from both CNN and Transformer, making it a
unified backbone in various vision tasks without convolutions by simply
replacing CNN backbones. (3) We validate PVT by conducting extensive
experiments, showing that it boosts the performance of many downstream tasks,
e.g., object detection, semantic, and instance segmentation. For example, with
a comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCO
dataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVT
could serve as an alternative and useful backbone for pixel-level predictions
and facilitate future researches. Code is available at
https://github.com/whai362/PVT.
Related papers
- ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions [4.554319452683839]
Vision Transformer (ViT) has achieved significant success in computer vision, but does not perform well in dense prediction tasks.
We present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer.
We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features.
arXiv Detail & Related papers (2024-03-12T07:59:41Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z) - Unified Visual Transformer Compression [102.26265546836329]
This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation.
We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations.
Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
arXiv Detail & Related papers (2022-03-15T20:38:22Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - PVT v2: Improved Baselines with Pyramid Vision Transformer [112.0139637538858]
We improve the original Pyramid Vision Transformer (PVT v1)
PVT v2 reduces the computational complexity of PVT v1 to linear.
It achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation.
arXiv Detail & Related papers (2021-06-25T17:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.