Distilling Token-Pruned Pose Transformer for 2D Human Pose Estimation
- URL: http://arxiv.org/abs/2304.05548v1
- Date: Wed, 12 Apr 2023 00:46:41 GMT
- Title: Distilling Token-Pruned Pose Transformer for 2D Human Pose Estimation
- Authors: Feixiang Ren
- Abstract summary: Recent token-Pruned Pose Transformer (PPT) solves this problem by pruning the background tokens of the image.
We present a novel method called Distilling Pruned-Token Transformer for human pose estimation (DPPT)
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human pose estimation has seen widespread use of transformer models in recent
years. Pose transformers benefit from the self-attention map, which captures
the correlation between human joint tokens and the image. However, training
such models is computationally expensive. The recent token-Pruned Pose
Transformer (PPT) solves this problem by pruning the background tokens of the
image, which are usually less informative. However, although it improves
efficiency, PPT inevitably leads to worse performance than TokenPose due to the
pruning of tokens. To overcome this problem, we present a novel method called
Distilling Pruned-Token Transformer for human pose estimation (DPPT). Our
method leverages the output of a pre-trained TokenPose to supervise the
learning process of PPT. We also establish connections between the internal
structure of pose transformers and PPT, such as attention maps and joint
features. Our experimental results on the MPII datasets show that our DPPT can
significantly improve PCK compared to previous PPT models while still reducing
computational complexity.
Related papers
- Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos.
Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks.
Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z) - Improved TokenPose with Sparsity [0.0]
We introduce sparsity in both keypoint token attention and visual token attention to improve human pose estimation.
Experimental results on the MPII dataset demonstrate that our model has a higher level of accuracy and proved the feasibility of the method.
arXiv Detail & Related papers (2023-11-16T08:12:34Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Joint Token Pruning and Squeezing Towards More Aggressive Compression of
Vision Transformers [2.0442992958844517]
We propose a novel Token Pruning & Squeezing module (TPS) for compressing vision transformers with higher efficiency.
TPS squeezes the information of pruned tokens into partial reserved tokens via the unidirectional nearest-neighbor matching and similarity-based fusing steps.
Our method can accelerate the throughput of DeiT-small beyond DeiT-tiny, while its accuracy surpasses DeiT-tiny by 4.78%.
arXiv Detail & Related papers (2023-04-21T02:59:30Z) - Image Deblurring by Exploring In-depth Properties of Transformer [86.7039249037193]
We leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics.
By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information.
One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space.
arXiv Detail & Related papers (2023-03-24T14:14:25Z) - PPT: token-Pruned Pose Transformer for monocular and multi-view human
pose estimation [25.878375219234975]
We propose a token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens.
We also propose a new cross-view fusion strategy, called human area fusion, which considers all human foreground pixels as corresponding candidates.
arXiv Detail & Related papers (2022-09-16T23:22:47Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Learned Token Pruning for Transformers [39.181816379061374]
Learned Token Pruning () method reduces redundant tokens as the data passes through the different layers of a transformer.
We extensively test the performance of our approach on multiple GLUE tasks.
Preliminary results show up to 1.4x and 1.9x throughput improvement on Tesla T4 and Intel Haswell.
arXiv Detail & Related papers (2021-07-02T09:00:13Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z) - Pre-Trained Image Processing Transformer [95.93031793337613]
We develop a new pre-trained model, namely, image processing transformer (IPT)
We present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs.
IPT model is trained on these images with multi-heads and multi-tails.
arXiv Detail & Related papers (2020-12-01T09:42:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.