PPT: token-Pruned Pose Transformer for monocular and multi-view human
pose estimation
- URL: http://arxiv.org/abs/2209.08194v1
- Date: Fri, 16 Sep 2022 23:22:47 GMT
- Title: PPT: token-Pruned Pose Transformer for monocular and multi-view human
pose estimation
- Authors: Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei
Liu, Xiangyi Yan, Hao Tang, Xiaohui Xie
- Abstract summary: We propose a token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens.
We also propose a new cross-view fusion strategy, called human area fusion, which considers all human foreground pixels as corresponding candidates.
- Score: 25.878375219234975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the vision transformer and its variants have played an increasingly
important role in both monocular and multi-view human pose estimation.
Considering image patches as tokens, transformers can model the global
dependencies within the entire image or across images from other views.
However, global attention is computationally expensive. As a consequence, it is
difficult to scale up these transformer-based methods to high-resolution
features and many views.
In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2D
human pose estimation, which can locate a rough human mask and performs
self-attention only within selected tokens. Furthermore, we extend our PPT to
multi-view human pose estimation. Built upon PPT, we propose a new cross-view
fusion strategy, called human area fusion, which considers all human foreground
pixels as corresponding candidates. Experimental results on COCO and MPII
demonstrate that our PPT can match the accuracy of previous pose transformer
methods while reducing the computation. Moreover, experiments on Human 3.6M and
Ski-Pose demonstrate that our Multi-view PPT can efficiently fuse cues from
multiple views and achieve new state-of-the-art results.
Related papers
- Human Mesh Recovery from Arbitrary Multi-view Images [57.969696744428475]
We propose a divide and conquer framework for Unified Human Mesh Recovery (U-HMR) from arbitrary multi-view images.
In particular, U-HMR consists of a decoupled structure and two main components: camera and body decoupling (CBD), camera pose estimation (CPE) and arbitrary view fusion (AVF)
We conduct extensive experiments on three public datasets: Human3.6M, MPI-INF-3DHP, and TotalCapture.
arXiv Detail & Related papers (2024-03-19T04:47:56Z) - Improved TokenPose with Sparsity [0.0]
We introduce sparsity in both keypoint token attention and visual token attention to improve human pose estimation.
Experimental results on the MPII dataset demonstrate that our model has a higher level of accuracy and proved the feasibility of the method.
arXiv Detail & Related papers (2023-11-16T08:12:34Z) - Distilling Token-Pruned Pose Transformer for 2D Human Pose Estimation [0.0]
Recent token-Pruned Pose Transformer (PPT) solves this problem by pruning the background tokens of the image.
We present a novel method called Distilling Pruned-Token Transformer for human pose estimation (DPPT)
arXiv Detail & Related papers (2023-04-12T00:46:41Z) - AdaptivePose++: A Powerful Single-Stage Network for Multi-Person Pose
Regression [66.39539141222524]
We propose to represent the human parts as adaptive points and introduce a fine-grained body representation method.
With the proposed body representation, we deliver a compact single-stage multi-person pose regression network, termed as AdaptivePose.
We employ AdaptivePose for both 2D/3D multi-person pose estimation tasks to verify the effectiveness of AdaptivePose.
arXiv Detail & Related papers (2022-10-08T12:54:20Z) - DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation [24.082220581799156]
We propose a novel Dual-Pipeline Integrated Transformer (DPIT) for human pose estimation.
DPIT consists of two branches, the bottom-up branch deals with the whole image to capture the global visual information.
The extracted feature representations from bottom-up and top-down branches are fed into the transformer encoder to fuse the global and local knowledge interactively.
arXiv Detail & Related papers (2022-09-02T10:18:26Z) - ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation [76.35955924137986]
We show that a plain vision transformer with MAE pretraining can obtain superior performance after finetuning on human pose estimation datasets.
Our biggest ViTPose model based on the ViTAE-G backbone with 1 billion parameters obtains the best 80.9 mAP on the MS COCO test-dev set.
arXiv Detail & Related papers (2022-04-26T17:55:04Z) - Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images.
MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks.
We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z) - TransFusion: Cross-view Fusion with Transformer for 3D Human Pose
Estimation [21.37032015978738]
We introduce a transformer framework for multi-view 3D pose estimation.
Inspired by previous multi-modal transformers, we design a unified transformer architecture, named TransFusion.
We propose the concept of epipolar field to encode 3D positional information into the transformer model.
arXiv Detail & Related papers (2021-10-18T18:08:18Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z) - AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in
the Wild [77.43884383743872]
We present AdaFuse, an adaptive multiview fusion method to enhance the features in occluded views.
We extensively evaluate the approach on three public datasets including Human3.6M, Total Capture and CMU Panoptic.
We also create a large scale synthetic dataset Occlusion-Person, which allows us to perform numerical evaluation on the occluded joints.
arXiv Detail & Related papers (2020-10-26T03:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.