PVT v2: Improved Baselines with Pyramid Vision Transformer
- URL: http://arxiv.org/abs/2106.13797v7
- Date: Mon, 17 Apr 2023 12:49:29 GMT
- Title: PVT v2: Improved Baselines with Pyramid Vision Transformer
- Authors: Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding
Liang, Tong Lu, Ping Luo, Ling Shao
- Abstract summary: We improve the original Pyramid Vision Transformer (PVT v1)
PVT v2 reduces the computational complexity of PVT v1 to linear.
It achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation.
- Score: 112.0139637538858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer recently has presented encouraging progress in computer vision.
In this work, we present new baselines by improving the original Pyramid Vision
Transformer (PVT v1) by adding three designs, including (1) linear complexity
attention layer, (2) overlapping patch embedding, and (3) convolutional
feed-forward network. With these modifications, PVT v2 reduces the
computational complexity of PVT v1 to linear and achieves significant
improvements on fundamental vision tasks such as classification, detection, and
segmentation. Notably, the proposed PVT v2 achieves comparable or better
performances than recent works such as Swin Transformer. We hope this work will
facilitate state-of-the-art Transformer researches in computer vision. Code is
available at https://github.com/whai362/PVT.
Related papers
- Retina Vision Transformer (RetinaViT): Introducing Scaled Patches into Vision Transformers [0.0]
We name this model Retina Vision Transformer (RetinaViT) due to its inspiration from the human visual system.
Our experiments show that when trained on the ImageNet-1K dataset with a moderate configuration, RetinaViT achieves a 3.3% performance improvement over the original ViT.
arXiv Detail & Related papers (2024-03-20T15:35:36Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - Making Vision Transformers Truly Shift-Equivariant [20.61570323513044]
Vision Transformers (ViTs) have become one of the go-to deep net architectures for computer vision.
We introduce novel data-adaptive designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding.
We evaluate the proposed adaptive models on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2023-05-25T17:59:40Z) - Coarse-to-Fine Vision Transformer [83.45020063642235]
We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance.
Our proposed CF-ViT is motivated by two important observations in modern ViT models.
Our CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
arXiv Detail & Related papers (2022-03-08T02:57:49Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Rethinking the Design Principles of Robust Vision Transformer [28.538786330184642]
Vision Transformers (ViT) have shown that self-attention-based networks surpassed traditional convolution neural networks (CNNs) in most vision tasks.
In this paper, we rethink the design principles of ViTs based on the robustness.
By combining the robust design components, we propose Robust Vision Transformer (RVT)
arXiv Detail & Related papers (2021-05-17T15:04:15Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z) - Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction
without Convolutions [103.03973037619532]
This work investigates a simple backbone network useful for many dense prediction tasks without convolutions.
Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer(PVT)
PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions.
arXiv Detail & Related papers (2021-02-24T08:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.