MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video
- URL: http://arxiv.org/abs/2203.00859v2
- Date: Thu, 3 Mar 2022 02:50:33 GMT
- Title: MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video
- Authors: Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, Junsong Yuan
- Abstract summary: Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
- Score: 75.23812405203778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent transformer-based solutions have been introduced to estimate 3D human
pose from 2D keypoint sequence by considering body joints among all frames
globally to learn spatio-temporal correlation. We observe that the motions of
different joints differ significantly. However, the previous methods cannot
efficiently model the solid inter-frame correspondence of each joint, leading
to insufficient learning of spatial-temporal correlation. We propose MixSTE
(Mixed Spatio-Temporal Encoder), which has a temporal transformer block to
separately model the temporal motion of each joint and a spatial transformer
block to learn inter-joint spatial correlation. These two blocks are utilized
alternately to obtain better spatio-temporal feature encoding. In addition, the
network output is extended from the central frame to entire frames of the input
video, thereby improving the coherence between the input and output sequences.
Extensive experiments are conducted on three benchmarks (i.e. Human3.6M,
MPI-INF-3DHP, and HumanEva) to evaluate the proposed method. The results show
that our model outperforms the state-of-the-art approach by 10.9% P-MPJPE and
7.6% MPJPE on the Human3.6M dataset. Code is available.
Related papers
- Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion [13.938406073551844]
This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation.
To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views.
Our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements.
arXiv Detail & Related papers (2024-10-06T18:15:27Z) - SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers [57.46911575980854]
We introduce SkelFormer, a novel markerless motion capture pipeline for multi-view human pose and shape estimation.
Our method first uses off-the-shelf 2D keypoint estimators, pre-trained on large-scale in-the-wild data, to obtain 3D joint positions.
Next, we design a regression-based inverse-kinematic skeletal transformer that maps the joint positions to pose and shape representations from heavily noisy observations.
arXiv Detail & Related papers (2024-04-19T04:51:18Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - Kinematic-aware Hierarchical Attention Network for Human Pose Estimation
in Videos [17.831839654593452]
Previous-based human pose estimation methods have shown promising results by leveraging features of consecutive frames.
Most approaches compromise accuracy to jitter and do not comprehend the temporal aspects of human motion.
We design an architecture that exploits kinematic keypoint features.
arXiv Detail & Related papers (2022-11-29T01:46:11Z) - (Fusionformer):Exploiting the Joint Motion Synergy with Fusion Network
Based On Transformer for 3D Human Pose Estimation [1.52292571922932]
Many previous methods lack the understanding of local joint information.cite8888987considers the temporal relationship of a single joint in this work.
Our proposed textbfFusionformer method introduces a global-temporal self-trajectory module and a cross-temporal self-trajectory module.
The results show an improvement of 2.4% MPJPE and 4.3% P-MPJPE on the Human3.6M dataset.
arXiv Detail & Related papers (2022-10-08T12:22:10Z) - Kinematics Modeling Network for Video-based Human Pose Estimation [9.506011491028891]
Estimating human poses from videos is critical in human-computer interaction.
Joints cooperate rather than move independently during human movement.
We propose a plug-and-play kinematics modeling module (KMM) to explicitly model temporal correlations between joints.
arXiv Detail & Related papers (2022-07-22T09:37:48Z) - CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose
Estimation [24.08170512746056]
3D human pose estimation can be handled by encoding the geometric dependencies between the body parts and enforcing the kinematic constraints.
Recent Transformer has been adopted to encode the long-range dependencies between the joints in the spatial and temporal domains.
We propose a novel pose estimation Transformer featuring rich representations of body joints critical for capturing subtle changes across frames.
arXiv Detail & Related papers (2022-03-24T23:40:11Z) - Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose
Estimation [61.98690211671168]
We propose a Multi-level Attention-Decoder Network (MAED) to model multi-level attentions in a unified framework.
With the training set of 3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4 mm of PA-MPJPE.
arXiv Detail & Related papers (2021-09-06T09:06:17Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - MotioNet: 3D Human Motion Reconstruction from Monocular Video with
Skeleton Consistency [72.82534577726334]
We introduce MotioNet, a deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video.
Our method is the first data-driven approach that directly outputs a kinematic skeleton, which is a complete, commonly used, motion representation.
arXiv Detail & Related papers (2020-06-22T08:50:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.