Related papers: PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers

PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers

URL: http://arxiv.org/abs/2303.09187v1
Date: Thu, 16 Mar 2023 09:55:43 GMT
Title: PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers
Authors: Zhongwei Qiu, Yang Qiansheng, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Chang Xu, Dongmei Fu, Jingdong Wang
Abstract summary: We propose a new end-to-end multi-person 3D and Shape estimation framework with progressive Video Transformer. In PSVT, a-temporal encoder (PGA) captures the global feature dependencies among spatial objects. To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used.
Score: 71.72888202522644
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing methods of multi-person video 3D human Pose and Shape Estimation (PSE) typically adopt a two-stage strategy, which first detects human instances in each frame and then performs single-person PSE with temporal model. However, the global spatio-temporal context among spatial instances can not be captured. In this paper, we propose a new end-to-end multi-person 3D Pose and Shape estimation framework with progressive Video Transformer, termed PSVT. In PSVT, a spatio-temporal encoder (STE) captures the global feature dependencies among spatial objects. Then, spatio-temporal pose decoder (STPD) and shape decoder (STSD) capture the global dependencies between pose queries and feature tokens, shape queries and feature tokens, respectively. To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used to update pose and shape queries at each frame. Besides, we propose a novel pose-guided attention (PGA) for shape decoder to better predict shape parameters. The two components strengthen the decoder of PSVT to improve performance. Extensive experiments on the four datasets show that PSVT achieves stage-of-the-art results.

Related papers

UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
Dynamic 3D Point Cloud Sequences as 2D Videos [81.46246338686478]
3D point cloud sequences serve as one of the most common and practical representation modalities of real-world environments. We propose a novel generic representation called textitStructured Point Cloud Videos (SPCVs) SPCVs re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points.
arXiv Detail & Related papers (2024-03-02T08:18:57Z)
Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction [3.069335774032178]
We propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences. We conduct experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons.
arXiv Detail & Related papers (2023-12-28T16:30:05Z)
Kinematic-aware Hierarchical Attention Network for Human Pose Estimation in Videos [17.831839654593452]
Previous-based human pose estimation methods have shown promising results by leveraging features of consecutive frames. Most approaches compromise accuracy to jitter and do not comprehend the temporal aspects of human motion. We design an architecture that exploits kinematic keypoint features.
arXiv Detail & Related papers (2022-11-29T01:46:11Z)
PSFormer: Point Transformer for 3D Salient Object Detection [8.621996554264275]
PSFormer is an encoder-decoder network that takes full advantage of transformers to model contextual information. In the encoder, we develop a Point Context Transformer (PCT) module to capture region contextual features at the point level. In the decoder, we develop a Scene Context Transformer (SCT) module to learn context representations at the scene level.
arXiv Detail & Related papers (2022-10-28T06:34:28Z)
AdaptivePose++: A Powerful Single-Stage Network for Multi-Person Pose Regression [66.39539141222524]
We propose to represent the human parts as adaptive points and introduce a fine-grained body representation method. With the proposed body representation, we deliver a compact single-stage multi-person pose regression network, termed as AdaptivePose. We employ AdaptivePose for both 2D/3D multi-person pose estimation tasks to verify the effectiveness of AdaptivePose.
arXiv Detail & Related papers (2022-10-08T12:54:20Z)
IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation [6.270047084514142]
Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos. IVT enables learningtemporal contextual depth information from visual features and 3D poses directly from video frames. Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.
arXiv Detail & Related papers (2022-08-06T02:36:33Z)
Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation [61.98690211671168]
We propose a Multi-level Attention-Decoder Network (MAED) to model multi-level attentions in a unified framework. With the training set of 3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4 mm of PA-MPJPE.
arXiv Detail & Related papers (2021-09-06T09:06:17Z)
3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos. Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure. We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.