HRFormer: High-Resolution Transformer for Dense Prediction
- URL: http://arxiv.org/abs/2110.09408v2
- Date: Thu, 21 Oct 2021 05:53:13 GMT
- Title: HRFormer: High-Resolution Transformer for Dense Prediction
- Authors: Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen,
Jingdong Wang
- Abstract summary: We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks.
We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet)
We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks.
- Score: 99.6060997466614
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a High-Resolution Transformer (HRFormer) that learns
high-resolution representations for dense prediction tasks, in contrast to the
original Vision Transformer that produces low-resolution representations and
has high memory and computational cost. We take advantage of the
multi-resolution parallel design introduced in high-resolution convolutional
networks (HRNet), along with local-window self-attention that performs
self-attention over small non-overlapping image windows, for improving the
memory and computation efficiency. In addition, we introduce a convolution into
the FFN to exchange information across the disconnected image windows. We
demonstrate the effectiveness of the High-Resolution Transformer on both human
pose estimation and semantic segmentation tasks, e.g., HRFormer outperforms
Swin transformer by $1.3$ AP on COCO pose estimation with $50\%$ fewer
parameters and $30\%$ fewer FLOPs. Code is available at:
https://github.com/HRNet/HRFormer.
Related papers
- Pruning By Explaining Revisited: Optimizing Attribution Methods to Prune CNNs and Transformers [14.756988176469365]
An effective approach to reduce computational requirements and increase efficiency is to prune unnecessary components of Deep Neural Networks.
Previous work has shown that attribution methods from the field of eXplainable AI serve as effective means to extract and prune the least relevant network components in a few-shot fashion.
arXiv Detail & Related papers (2024-08-22T17:35:18Z) - PTSR: Patch Translator for Image Super-Resolution [16.243363392717434]
We propose a patch translator for image super-resolution (PTSR) to address this problem.
The proposed PTSR is a transformer-based GAN network with no convolution operation.
We introduce a novel patch translator module for regenerating the improved patches utilising multi-head attention.
arXiv Detail & Related papers (2023-10-20T01:45:00Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - Restormer: Efficient Transformer for High-Resolution Image Restoration [118.9617735769827]
convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data.
Transformers have shown significant performance gains on natural language and high-level vision tasks.
Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks.
arXiv Detail & Related papers (2021-11-18T18:59:10Z) - Improved Transformer for High-Resolution GANs [69.42469272015481]
We introduce two key ingredients to Transformer to address this challenge.
We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet $128 times 128$ and FFHQ $256 times 256$, respectively.
arXiv Detail & Related papers (2021-06-14T17:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.