P2T: Pyramid Pooling Transformer for Scene Understanding
- URL: http://arxiv.org/abs/2106.12011v1
- Date: Tue, 22 Jun 2021 18:28:52 GMT
- Title: P2T: Pyramid Pooling Transformer for Scene Understanding
- Authors: Yu-Huan Wu, Yun Liu, Xin Zhan, Ming-Ming Cheng
- Abstract summary: We build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T)
Plugged with our pooling-based MHSA, we build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T)
- Score: 62.41912463252468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper jointly resolves two problems in vision transformer: i) the
computation of Multi-Head Self-Attention (MHSA) has high computational/space
complexity; ii) recent vision transformer networks are overly tuned for image
classification, ignoring the difference between image classification (simple
scenarios, more similar to NLP) and downstream scene understanding tasks
(complicated scenarios, rich structural and contextual information). To this
end, we note that pyramid pooling has been demonstrated to be effective in
various vision tasks owing to its powerful context abstraction, and its natural
property of spatial invariance is suitable to address the loss of structural
information (problem ii)). Hence, we propose to adapt pyramid pooling to MHSA
for alleviating its high requirement on computational resources (problem i)).
In this way, this pooling-based MHSA can well address the above two problems
and is thus flexible and powerful for downstream scene understanding tasks.
Plugged with our pooling-based MHSA, we build a downstream-task-oriented
transformer network, dubbed Pyramid Pooling Transformer (P2T). Extensive
experiments demonstrate that, when applied P2T as the backbone network, it
shows substantial superiority in various downstream scene understanding tasks
such as semantic segmentation, object detection, instance segmentation, and
visual saliency detection, compared to previous CNN- and transformer-based
networks. The code will be released at https://github.com/yuhuan-wu/P2T. Note
that this technical report will keep updating.
Related papers
- Adaptive Step-size Perception Unfolding Network with Non-local Hybrid Attention for Hyperspectral Image Reconstruction [0.39134031118910273]
We propose an adaptive step-size perception unfolding network (ASPUN), a deep unfolding network based on FISTA algorithm.
In addition, we design a Non-local Hybrid Attention Transformer(NHAT) module for fully leveraging the receptive field advantage of transformer.
Experimental results show that our ASPUN is superior to the existing SOTA algorithms and achieves the best performance.
arXiv Detail & Related papers (2024-07-04T16:09:52Z) - CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - Dual-Tasks Siamese Transformer Framework for Building Damage Assessment [11.888964682446879]
We present the first attempt at designing a Transformer-based damage assessment architecture (DamFormer)
To the best of our knowledge, it is the first time that such a deep Transformer-based network is proposed for multitemporal remote sensing interpretation tasks.
arXiv Detail & Related papers (2022-01-26T14:11:16Z) - Augmenting Convolutional networks with attention-based aggregation [55.97184767391253]
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning.
We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth)
It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption.
arXiv Detail & Related papers (2021-12-27T14:05:41Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.