Swin-Pose: Swin Transformer Based Human Pose Estimation
- URL: http://arxiv.org/abs/2201.07384v1
- Date: Wed, 19 Jan 2022 02:15:26 GMT
- Title: Swin-Pose: Swin Transformer Based Human Pose Estimation
- Authors: Zinan Xiong, Chenxi Wang, Ying Li, Yan Luo, Yu Cao
- Abstract summary: Convolutional neural networks (CNNs) have been widely utilized in many computer vision tasks.
CNNs have a fixed reception field and lack the ability of long-range perception, which is crucial to human pose estimation.
We propose a novel model based on transformer architecture, enhanced with a feature pyramid fusion structure.
- Score: 16.247836509380026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural networks (CNNs) have been widely utilized in many
computer vision tasks. However, CNNs have a fixed reception field and lack the
ability of long-range perception, which is crucial to human pose estimation.
Due to its capability to capture long-range dependencies between pixels,
transformer architecture has been adopted to computer vision applications
recently and is proven to be a highly effective architecture. We are interested
in exploring its capability in human pose estimation, and thus propose a novel
model based on transformer architecture, enhanced with a feature pyramid fusion
structure. More specifically, we use pre-trained Swin Transformer as our
backbone and extract features from input images, we leverage a feature pyramid
structure to extract feature maps from different stages. By fusing the features
together, our model predicts the keypoint heatmap. The experiment results of
our study have demonstrated that the proposed transformer-based model can
achieve better performance compared to the state-of-the-art CNN-based models.
Related papers
- TransDeepLab: Convolution-Free Transformer-based DeepLab v3+ for Medical
Image Segmentation [11.190117191084175]
This paper proposes TransDeepLab, a novel DeepLab-like pure Transformer for medical image segmentation.
We exploit hierarchical Swin-Transformer with shifted windows to extend the DeepLabv3 and model the Atrous Spatial Pyramid Pooling (ASPP) module.
Our approach performs superior or on par with most contemporary works on an amalgamation of Vision Transformer and CNN-based methods.
arXiv Detail & Related papers (2022-08-01T09:53:53Z) - BTranspose: Bottleneck Transformers for Human Pose Estimation with
Self-Supervised Pre-Training [0.304585143845864]
In this paper, we consider the recently proposed Bottleneck Transformers, which combine CNN and multi-head self attention (MHSA) layers effectively.
We consider different backbone architectures and pre-train them using the DINO self-supervised learning method.
Experiments show that our model achieves an AP of 76.4, which is competitive with other methods such as [1] and has fewer network parameters.
arXiv Detail & Related papers (2022-04-21T15:45:05Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - FoveaTer: Foveated Transformer for Image Classification [8.207403859762044]
We propose foveated Transformer (FoveaTer) model, which uses pooling regions and saccadic movements to perform object classification tasks.
We construct an ensemble model using our proposed model and unfoveated model, achieving an accuracy 1.36% below the unfoveated model with 22% computational savings.
arXiv Detail & Related papers (2021-05-29T01:54:33Z) - Rethinking Spatial Dimensions of Vision Transformers [34.13899937264952]
Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks.
We investigate the role of the spatial dimension conversion and its effectiveness on the transformer-based architecture.
We propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model.
arXiv Detail & Related papers (2021-03-30T12:51:28Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - 3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos.
Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure.
We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.