ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
- URL: http://arxiv.org/abs/2204.12484v1
- Date: Tue, 26 Apr 2022 17:55:04 GMT
- Title: ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
- Authors: Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao
- Abstract summary: We show that a plain vision transformer with MAE pretraining can obtain superior performance after finetuning on human pose estimation datasets.
Our biggest ViTPose model based on the ViTAE-G backbone with 1 billion parameters obtains the best 80.9 mAP on the MS COCO test-dev set.
- Score: 76.35955924137986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, customized vision transformers have been adapted for human pose
estimation and have achieved superior performance with elaborate structures.
However, it is still unclear whether plain vision transformers can facilitate
pose estimation. In this paper, we take the first step toward answering the
question by employing a plain and non-hierarchical vision transformer together
with simple deconvolution decoders termed ViTPose for human pose estimation. We
demonstrate that a plain vision transformer with MAE pretraining can obtain
superior performance after finetuning on human pose estimation datasets.
ViTPose has good scalability with respect to model size and flexibility
regarding input resolution and token number. Moreover, it can be easily
pretrained using the unlabeled pose data without the need for large-scale
upstream ImageNet data. Our biggest ViTPose model based on the ViTAE-G backbone
with 1 billion parameters obtains the best 80.9 mAP on the MS COCO test-dev
set, while the ensemble models further set a new state-of-the-art for human
pose estimation, i.e., 81.1 mAP. The source code and models will be released at
https://github.com/ViTAE-Transformer/ViTPose.
Related papers
- HSViT: Horizontally Scalable Vision Transformer [16.46308352393693]
Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well.
This paper introduces a novel horizontally scalable vision transformer (HSViT) scheme.
HSViT achieves up to 10% higher top-1 accuracy than state-of-the-art schemes on small datasets.
arXiv Detail & Related papers (2024-04-08T04:53:29Z) - MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation [4.563488428831042]
We release MiDaS v3.1 for monocular depth estimation, offering a variety of new models based on different encoder backbones.
We explore how using the most promising vision transformers as image encoders impacts depth estimation quality and runtime of the MiDaS architecture.
arXiv Detail & Related papers (2023-07-26T19:01:49Z) - ViTPose++: Vision Transformer for Generic Body Pose Estimation [70.86760562151163]
We show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects.
ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints.
We empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token.
arXiv Detail & Related papers (2022-12-07T12:33:28Z) - Vision Transformer Adapter for Dense Predictions [57.590511173416445]
Vision Transformer (ViT) achieves inferior performance on dense prediction tasks due to lacking prior information of images.
We propose a Vision Transformer Adapter (ViT-Adapter) which can remedy the defects of ViT and achieve comparable performance to vision-specific models.
We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation.
arXiv Detail & Related papers (2022-05-17T17:59:11Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.