P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose
Estimation
- URL: http://arxiv.org/abs/2203.07628v1
- Date: Tue, 15 Mar 2022 04:00:59 GMT
- Title: P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose
Estimation
- Authors: Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, Wen
Gao
- Abstract summary: This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task.
Our method outperforms state-of-the-art methods with fewer parameters and less computational overhead.
- Score: 78.83305967085413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel Pre-trained Spatial Temporal Many-to-One
(P-STMO) model for 2D-to-3D human pose estimation task. To reduce the
difficulty of capturing spatial and temporal information, we divide this task
into two stages: pre-training (Stage I) and fine-tuning (Stage II). In Stage I,
a self-supervised pre-training sub-task, termed masked pose modeling, is
proposed. The human joints in the input sequence are randomly masked in both
spatial and temporal domains. A general form of denoising auto-encoder is
exploited to recover the original 2D poses and the encoder is capable of
capturing spatial and temporal dependencies in this way. In Stage II, the
pre-trained encoder is loaded to STMO model and fine-tuned. The encoder is
followed by a many-to-one frame aggregator to predict the 3D pose in the
current frame. Especially, an MLP block is utilized as the spatial feature
extractor in STMO, which yields better performance than other methods. In
addition, a temporal downsampling strategy is proposed to diminish data
redundancy. Extensive experiments on two benchmarks show that our method
outperforms state-of-the-art methods with fewer parameters and less
computational overhead. For example, our P-STMO model achieves 42.1mm MPJPE on
Human3.6M dataset when using 2D poses from CPN as inputs. Meanwhile, it brings
a 1.5-7.1 times speedup to state-of-the-art methods. Code is available at
https://github.com/paTRICK-swk/P-STMO.
Related papers
- Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture.
Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model.
Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z) - Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - TEMPO: Efficient Multi-View Pose Estimation, Tracking, and Forecasting [27.3359362364858]
We present an efficient multi-view pose estimation model that learns a robust temporal representation.
Our model is able to generalize across datasets without fine-tuning.
arXiv Detail & Related papers (2023-09-14T17:56:30Z) - PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D
Human Pose Estimation [19.028127284305224]
We propose PoseFormerV2, which exploits a compact representation of lengthy skeleton sequences in the frequency domain to efficiently scale up the receptive field.
With minimum modifications to PoseFormer, the proposed method effectively fuses features both in the time domain and frequency domain, enjoying a better speed-accuracy trade-off than its precursor.
arXiv Detail & Related papers (2023-03-30T15:45:51Z) - Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting
Transformers [28.586258731448687]
We present a Transformer-based pose uplifting scheme that can operate on temporally sparse 2D pose sequences.
We show how masked token modeling can be utilized for temporal upsampling within Transformer blocks.
We evaluate our method on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP.
arXiv Detail & Related papers (2022-10-12T12:00:56Z) - A generic diffusion-based approach for 3D human pose prediction in the
wild [68.00961210467479]
3D human pose forecasting, i.e., predicting a sequence of future human 3D poses given a sequence of past observed ones, is a challenging-temporal task.
We provide a unified formulation in which incomplete elements (no matter in the prediction or observation) are treated as noise and propose a conditional diffusion model that denoises them and forecasts plausible poses.
We investigate our findings on four standard datasets and obtain significant improvements over the state-of-the-art.
arXiv Detail & Related papers (2022-10-11T17:59:54Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose
Estimation [61.98690211671168]
We propose a Multi-level Attention-Decoder Network (MAED) to model multi-level attentions in a unified framework.
With the training set of 3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4 mm of PA-MPJPE.
arXiv Detail & Related papers (2021-09-06T09:06:17Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.