PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid
Architecture
- URL: http://arxiv.org/abs/2201.00978v1
- Date: Tue, 4 Jan 2022 04:56:57 GMT
- Title: PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid
Architecture
- Authors: Kai Han, Jianyuan Guo, Yehui Tang, Yunhe Wang
- Abstract summary: Transformer-in-Transformer (TNT) architecture utilizes inner transformer and outer transformer to extract both local and global representations.
New "PyramidTNT" significantly improves the original TNT by establishing hierarchical representations.
PyramidTNT achieves better performances than the previous state-of-the-art vision transformers such as Swin Transformer.
- Score: 46.252298619903165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer networks have achieved great progress for computer vision tasks.
Transformer-in-Transformer (TNT) architecture utilizes inner transformer and
outer transformer to extract both local and global representations. In this
work, we present new TNT baselines by introducing two advanced designs: 1)
pyramid architecture, and 2) convolutional stem. The new "PyramidTNT"
significantly improves the original TNT by establishing hierarchical
representations. PyramidTNT achieves better performances than the previous
state-of-the-art vision transformers such as Swin Transformer. We hope this new
baseline will be helpful to the further research and application of vision
transformer. Code will be available at
https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch.
Related papers
- Efficient Visual Transformer by Learnable Token Merging [8.905020033545643]
We propose a novel transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer.
LTM-Transformer is compatible with many popular and compact transformer networks.
It renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers.
arXiv Detail & Related papers (2024-07-21T17:09:19Z) - TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
Transformer [188.00681648113223]
We explore neat yet effective Transformer-based frameworks for visual grounding.
TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates.
We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
arXiv Detail & Related papers (2022-06-14T06:27:38Z) - Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy
for Image Recognition without Convolutions [1.1032962642000486]
This work is based on Vision Transformer, combined with the pyramid architecture, using Split-merge-transform to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT)
We perform image classification tasks on the CIFAR-10 dataset and object detection tasks on the COCO 2017 dataset.
arXiv Detail & Related papers (2022-03-02T09:14:28Z) - PVT v2: Improved Baselines with Pyramid Vision Transformer [112.0139637538858]
We improve the original Pyramid Vision Transformer (PVT v1)
PVT v2 reduces the computational complexity of PVT v1 to linear.
It achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation.
arXiv Detail & Related papers (2021-06-25T17:51:09Z) - TransCenter: Transformers with Dense Queries for Multiple-Object
Tracking [87.75122600164167]
We argue that the standard representation -- bounding boxes -- is not adapted to learning transformers for multiple-object tracking.
We propose TransCenter, the first transformer-based architecture for tracking the centers of multiple targets.
arXiv Detail & Related papers (2021-03-28T14:49:36Z) - Transformer in Transformer [59.066686278998354]
We propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation.
Our TNT achieves $81.3%$ top-1 accuracy on ImageNet which is $1.5%$ higher than that of DeiT with similar computational cost.
arXiv Detail & Related papers (2021-02-27T03:12:16Z) - Transformer for Image Quality Assessment [14.975436239088312]
We propose an architecture of using a shallow Transformer encoder on the top of a feature map extracted by convolution neural networks (CNN)
Adaptive positional embedding is employed in the Transformer encoder to handle images with arbitrary resolutions.
We have found that the proposed TRIQ architecture achieves outstanding performance.
arXiv Detail & Related papers (2020-12-30T18:43:11Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.