Related papers: PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture

PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture

URL: http://arxiv.org/abs/2201.00978v1
Date: Tue, 4 Jan 2022 04:56:57 GMT
Title: PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture
Authors: Kai Han, Jianyuan Guo, Yehui Tang, Yunhe Wang
Abstract summary: Transformer-in-Transformer (TNT) architecture utilizes inner transformer and outer transformer to extract both local and global representations. New "PyramidTNT" significantly improves the original TNT by establishing hierarchical representations. PyramidTNT achieves better performances than the previous state-of-the-art vision transformers such as Swin Transformer.
Score: 46.252298619903165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer networks have achieved great progress for computer vision tasks. Transformer-in-Transformer (TNT) architecture utilizes inner transformer and outer transformer to extract both local and global representations. In this work, we present new TNT baselines by introducing two advanced designs: 1) pyramid architecture, and 2) convolutional stem. The new "PyramidTNT" significantly improves the original TNT by establishing hierarchical representations. PyramidTNT achieves better performances than the previous state-of-the-art vision transformers such as Swin Transformer. We hope this new baseline will be helpful to the further research and application of vision transformer. Code will be available at https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch.

Related papers

Efficient Visual Transformer by Learnable Token Merging [8.905020033545643]
We propose a novel transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer. LTM-Transformer is compatible with many popular and compact transformer networks. It renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers.
arXiv Detail & Related papers (2024-07-21T17:09:19Z)
TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer [188.00681648113223]
We explore neat yet effective Transformer-based frameworks for visual grounding. TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
arXiv Detail & Related papers (2022-06-14T06:27:38Z)
Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions [1.1032962642000486]
This work is based on Vision Transformer, combined with the pyramid architecture, using Split-merge-transform to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT) We perform image classification tasks on the CIFAR-10 dataset and object detection tasks on the COCO 2017 dataset.
arXiv Detail & Related papers (2022-03-02T09:14:28Z)
PVT v2: Improved Baselines with Pyramid Vision Transformer [112.0139637538858]
We improve the original Pyramid Vision Transformer (PVT v1) PVT v2 reduces the computational complexity of PVT v1 to linear. It achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation.
arXiv Detail & Related papers (2021-06-25T17:51:09Z)
TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [87.75122600164167]
We argue that the standard representation -- bounding boxes -- is not adapted to learning transformers for multiple-object tracking. We propose TransCenter, the first transformer-based architecture for tracking the centers of multiple targets.
arXiv Detail & Related papers (2021-03-28T14:49:36Z)
Transformer in Transformer [59.066686278998354]
We propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. Our TNT achieves $81.3%$ top-1 accuracy on ImageNet which is $1.5%$ higher than that of DeiT with similar computational cost.
arXiv Detail & Related papers (2021-02-27T03:12:16Z)
Transformer for Image Quality Assessment [14.975436239088312]
We propose an architecture of using a shallow Transformer encoder on the top of a feature map extracted by convolution neural networks (CNN) Adaptive positional embedding is employed in the Transformer encoder to handle images with arbitrary resolutions. We have found that the proposed TRIQ architecture achieves outstanding performance.
arXiv Detail & Related papers (2020-12-30T18:43:11Z)
A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.