HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation
- URL: http://arxiv.org/abs/2311.13615v1
- Date: Wed, 22 Nov 2023 06:45:16 GMT
- Title: HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation
- Authors: Chengpeng Wu, Guangxing Tan, Chunyu Li
- Abstract summary: This paper proposes a High- Efficiency Vision Transformer for Human Pose Estimation (HEViTPose)
In HEViTPose, a Cascaded Group Spatial Reduction Multi-Head Attention Module (CGSR-MHA) is proposed, which reduces the computational cost.
Comprehensive experiments on two benchmark datasets (MPII and COCO) demonstrate that the small and large HEViTPose models are on par with state-of-the-art models.
- Score: 3.1690891866882236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human pose estimation in complicated situations has always been a challenging
task. Many Transformer-based pose networks have been proposed recently,
achieving encouraging progress in improving performance. However, the
remarkable performance of pose networks is always accompanied by heavy
computation costs and large network scale. In order to deal with this problem,
this paper proposes a High-Efficiency Vision Transformer for Human Pose
Estimation (HEViTPose). In HEViTPose, a Cascaded Group Spatial Reduction
Multi-Head Attention Module (CGSR-MHA) is proposed, which reduces the
computational cost through feature grouping and spatial degradation mechanisms,
while preserving feature diversity through multiple low-dimensional attention
heads. Moreover, a concept of Patch Embedded Overlap Width (PEOW) is defined to
help understand the relationship between the amount of overlap and local
continuity. By optimising PEOW, our model gains improvements in performance,
parameters and GFLOPs.
Comprehensive experiments on two benchmark datasets (MPII and COCO)
demonstrate that the small and large HEViTPose models are on par with
state-of-the-art models while being more lightweight. Specifically, HEViTPose-B
achieves 90.7 PCK@0.5 on the MPII test set and 72.6 AP on the COCO test-dev2017
set. Compared with HRNet-W32 and Swin-S, our HEViTPose-B significantly reducing
Params ($\downarrow$62.1%,$\downarrow$80.4%,) and GFLOPs
($\downarrow$43.4%,$\downarrow$63.8%,). Code and models are available at
\url{here}.
Related papers
- HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs [102.4965532024391]
hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks.
We present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs.
HiRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$times
arXiv Detail & Related papers (2024-03-18T17:34:29Z) - GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous
Structured Pruning for Vision Transformer [76.2625311630021]
Vision transformers (ViTs) have shown very impressive empirical performance in various computer vision tasks.
To mitigate this challenging problem, structured pruning is a promising solution to compress model size and enable practical efficiency.
We propose GOHSP, a unified framework of Graph and Optimization-based Structured Pruning for ViT models.
arXiv Detail & Related papers (2023-01-13T00:40:24Z) - ViTPose++: Vision Transformer for Generic Body Pose Estimation [70.86760562151163]
We show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects.
ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints.
We empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token.
arXiv Detail & Related papers (2022-12-07T12:33:28Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for
Vision Transformers [2.954890575035673]
Data-free quantization can potentially address data privacy and security concerns in model compression.
Recently, PSAQ-ViT designs a relative value metric, patch similarity, to generate data from pre-trained vision transformers (ViTs)
In this paper, we propose PSAQ-ViT V2, a more accurate and general data-free quantization framework for ViTs.
arXiv Detail & Related papers (2022-09-13T01:55:53Z) - Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks.
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z) - ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation [76.35955924137986]
We show that a plain vision transformer with MAE pretraining can obtain superior performance after finetuning on human pose estimation datasets.
Our biggest ViTPose model based on the ViTAE-G backbone with 1 billion parameters obtains the best 80.9 mAP on the MS COCO test-dev set.
arXiv Detail & Related papers (2022-04-26T17:55:04Z) - Towards Simple and Accurate Human Pose Estimation with Stair Network [34.421529219040295]
We develop a small yet discrimicative model called STair Network, which can be stacked towards an accurate multi-stage pose estimation system.
To reduce computational cost, STair Network is composed of novel basic feature extraction blocks.
We demonstrate the effectiveness of the STair Network on two standard datasets.
arXiv Detail & Related papers (2022-02-18T10:37:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.