HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose
Estimation
- URL: http://arxiv.org/abs/2301.07322v1
- Date: Wed, 18 Jan 2023 05:54:02 GMT
- Title: HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose
Estimation
- Authors: Xiaoye Qian, Youbao Tang, Ning Zhang, Mei Han, Jing Xiao, Ming-Chun
Huang, Ruei-Sung Lin
- Abstract summary: We propose Hierarchical Spatial-Temporal transFormers (HSTFormer) to capture multi-level joints' spatial-temporal correlations from local to global gradually for accurate 3D human pose estimation.
HSTFormer consists of four transformer encoders (TEs) and a fusion module. To the best of our knowledge, HSTFormer is the first to study hierarchical TEs with multi-level fusion.
It surpasses recent SOTAs on the challenging MPI-INF-3DHP dataset and small-scale HumanEva dataset, with a highly generalized systematic approach.
- Score: 22.648409352844997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based approaches have been successfully proposed for 3D human
pose estimation (HPE) from 2D pose sequence and achieved state-of-the-art
(SOTA) performance. However, current SOTAs have difficulties in modeling
spatial-temporal correlations of joints at different levels simultaneously.
This is due to the poses' spatial-temporal complexity. Poses move at various
speeds temporarily with various joints and body-parts movement spatially.
Hence, a cookie-cutter transformer is non-adaptable and can hardly meet the
"in-the-wild" requirement. To mitigate this issue, we propose Hierarchical
Spatial-Temporal transFormers (HSTFormer) to capture multi-level joints'
spatial-temporal correlations from local to global gradually for accurate 3D
HPE. HSTFormer consists of four transformer encoders (TEs) and a fusion module.
To the best of our knowledge, HSTFormer is the first to study hierarchical TEs
with multi-level fusion. Extensive experiments on three datasets (i.e.,
Human3.6M, MPI-INF-3DHP, and HumanEva) demonstrate that HSTFormer achieves
competitive and consistent performance on benchmarks with various scales and
difficulties. Specifically, it surpasses recent SOTAs on the challenging
MPI-INF-3DHP dataset and small-scale HumanEva dataset, with a highly
generalized systematic approach. The code is available at:
https://github.com/qianxiaoye825/HSTFormer.
Related papers
- Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion [13.938406073551844]
This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation.
To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views.
Our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements.
arXiv Detail & Related papers (2024-10-06T18:15:27Z) - Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture.
Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model.
Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z) - ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose
Estimation by Leveraging Dynamic Multi-Headed Convolutional Attention [0.0]
textbftextitConvFormer is a novel convolutional transformer for the 3D human pose estimation task.
We have validated our method on three common benchmark datasets: Human3.6M, MPI-INF-3DHP, and HumanEva.
arXiv Detail & Related papers (2023-04-04T22:23:50Z) - PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with
Progressive Video Transformers [71.72888202522644]
We propose a new end-to-end multi-person 3D and Shape estimation framework with progressive Video Transformer.
In PSVT, a-temporal encoder (PGA) captures the global feature dependencies among spatial objects.
To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used.
arXiv Detail & Related papers (2023-03-16T09:55:43Z) - HDFormer: High-order Directed Transformer for 3D Human Pose Estimation [20.386530242069338]
HDFormer significantly outperforms state-of-the-art (SOTA) models on Human3.6M and MPI-INF-3DHP datasets.
HDFormer demonstrates broad real-world applicability, enabling real-time, accurate 3D pose estimation.
arXiv Detail & Related papers (2023-02-03T16:00:48Z) - P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose
Estimation [78.83305967085413]
This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task.
Our method outperforms state-of-the-art methods with fewer parameters and less computational overhead.
arXiv Detail & Related papers (2022-03-15T04:00:59Z) - MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z) - Geometry-Contrastive Transformer for Generalized 3D Pose Transfer [95.56457218144983]
The intuition of this work is to perceive the geometric inconsistency between the given meshes with the powerful self-attention mechanism.
We propose a novel geometry-contrastive Transformer that has an efficient 3D structured perceiving ability to the global geometric inconsistencies.
We present a latent isometric regularization module together with a novel semi-synthesized dataset for the cross-dataset 3D pose transfer task.
arXiv Detail & Related papers (2021-12-14T13:14:24Z) - Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose
Estimation [61.98690211671168]
We propose a Multi-level Attention-Decoder Network (MAED) to model multi-level attentions in a unified framework.
With the training set of 3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4 mm of PA-MPJPE.
arXiv Detail & Related papers (2021-09-06T09:06:17Z) - 3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos.
Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure.
We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.