ViTPose++: Vision Transformer for Generic Body Pose Estimation
- URL: http://arxiv.org/abs/2212.04246v3
- Date: Thu, 14 Dec 2023 08:57:43 GMT
- Title: ViTPose++: Vision Transformer for Generic Body Pose Estimation
- Authors: Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao
- Abstract summary: We show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects.
ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints.
We empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token.
- Score: 70.86760562151163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we show the surprisingly good properties of plain vision
transformers for body pose estimation from various aspects, namely simplicity
in model structure, scalability in model size, flexibility in training
paradigm, and transferability of knowledge between models, through a simple
baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and
non-hierarchical vision transformer as an encoder to encode features and a
lightweight decoder to decode body keypoints in either a top-down or a
bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking
advantage of the scalable model capacity and high parallelism of the vision
transformer, setting a new Pareto front for throughput and performance.
Besides, ViTPose is very flexible regarding the attention type, input
resolution, and pre-training and fine-tuning strategy. Based on the
flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body
keypoint categories in different types of body pose estimation tasks via
knowledge factorization, i.e., adopting task-agnostic and task-specific
feed-forward networks in the transformer. We also empirically demonstrate that
the knowledge of large ViTPose models can be easily transferred to small ones
via a simple knowledge token. Experimental results show that our ViTPose model
outperforms representative methods on the challenging MS COCO Human Keypoint
Detection benchmark at both top-down and bottom-up settings. Furthermore, our
ViTPose+ model achieves state-of-the-art performance simultaneously on a series
of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII
for human keypoint detection, COCO-Wholebody for whole-body keypoint detection,
as well as AP-10K and APT-36K for animal keypoint detection, without
sacrificing inference speed.
Related papers
- Applying ViT in Generalized Few-shot Semantic Segmentation [0.0]
This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework.
We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models.
We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks.
arXiv Detail & Related papers (2024-08-27T11:04:53Z) - Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding [81.1943823985213]
In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices.
We introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible.
Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT)
The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB.
arXiv Detail & Related papers (2023-12-27T08:52:41Z) - HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation [3.1690891866882236]
This paper proposes a High- Efficiency Vision Transformer for Human Pose Estimation (HEViTPose)
In HEViTPose, a Cascaded Group Spatial Reduction Multi-Head Attention Module (CGSR-MHA) is proposed, which reduces the computational cost.
Comprehensive experiments on two benchmark datasets (MPII and COCO) demonstrate that the small and large HEViTPose models are on par with state-of-the-art models.
arXiv Detail & Related papers (2023-11-22T06:45:16Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for
Vision Transformers [2.954890575035673]
Data-free quantization can potentially address data privacy and security concerns in model compression.
Recently, PSAQ-ViT designs a relative value metric, patch similarity, to generate data from pre-trained vision transformers (ViTs)
In this paper, we propose PSAQ-ViT V2, a more accurate and general data-free quantization framework for ViTs.
arXiv Detail & Related papers (2022-09-13T01:55:53Z) - ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation [76.35955924137986]
We show that a plain vision transformer with MAE pretraining can obtain superior performance after finetuning on human pose estimation datasets.
Our biggest ViTPose model based on the ViTAE-G backbone with 1 billion parameters obtains the best 80.9 mAP on the MS COCO test-dev set.
arXiv Detail & Related papers (2022-04-26T17:55:04Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.