PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D
Human Pose Estimation
- URL: http://arxiv.org/abs/2303.17472v1
- Date: Thu, 30 Mar 2023 15:45:51 GMT
- Title: PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D
Human Pose Estimation
- Authors: Qitao Zhao, Ce Zheng, Mengyuan Liu, Pichao Wang, Chen Chen
- Abstract summary: We propose PoseFormerV2, which exploits a compact representation of lengthy skeleton sequences in the frequency domain to efficiently scale up the receptive field.
With minimum modifications to PoseFormer, the proposed method effectively fuses features both in the time domain and frequency domain, enjoying a better speed-accuracy trade-off than its precursor.
- Score: 19.028127284305224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, transformer-based methods have gained significant success in
sequential 2D-to-3D lifting human pose estimation. As a pioneering work,
PoseFormer captures spatial relations of human joints in each video frame and
human dynamics across frames with cascaded transformer layers and has achieved
impressive performance. However, in real scenarios, the performance of
PoseFormer and its follow-ups is limited by two factors: (a) The length of the
input joint sequence; (b) The quality of 2D joint detection. Existing methods
typically apply self-attention to all frames of the input sequence, causing a
huge computational burden when the frame number is increased to obtain advanced
estimation accuracy, and they are not robust to noise naturally brought by the
limited capability of 2D joint detectors. In this paper, we propose
PoseFormerV2, which exploits a compact representation of lengthy skeleton
sequences in the frequency domain to efficiently scale up the receptive field
and boost robustness to noisy 2D joint detection. With minimum modifications to
PoseFormer, the proposed method effectively fuses features both in the time
domain and frequency domain, enjoying a better speed-accuracy trade-off than
its precursor. Extensive experiments on two benchmark datasets (i.e., Human3.6M
and MPI-INF-3DHP) demonstrate that the proposed approach significantly
outperforms the original PoseFormer and other transformer-based variants. Code
is released at \url{https://github.com/QitaoZhao/PoseFormerV2}.
Related papers
- Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture.
Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model.
Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers [57.46911575980854]
We introduce SkelFormer, a novel markerless motion capture pipeline for multi-view human pose and shape estimation.
Our method first uses off-the-shelf 2D keypoint estimators, pre-trained on large-scale in-the-wild data, to obtain 3D joint positions.
Next, we design a regression-based inverse-kinematic skeletal transformer that maps the joint positions to pose and shape representations from heavily noisy observations.
arXiv Detail & Related papers (2024-04-19T04:51:18Z) - A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose
Estimation [18.72362803593654]
The dominant paradigm in 3D human pose estimation that lifts a 2D pose sequence to 3D heavily relies on long-term temporal clues.
This can be attributed to their inherent inability to perceive spatial context as plain 2D joint coordinates carry no visual cues.
We propose a straightforward yet powerful solution: leveraging the readily available intermediate visual representations produced by off-the-shelf (pre-trained) 2D pose detectors.
arXiv Detail & Related papers (2023-11-06T18:04:13Z) - EVOPOSE: A Recursive Transformer For 3D Human Pose Estimation With
Kinematic Structure Priors [72.33767389878473]
We propose a transformer-based model EvoPose to introduce the human body prior knowledge for 3D human pose estimation effectively.
A Structural Priors Representation (SPR) module represents human priors as structural features carrying rich body patterns.
A Recursive Refinement (RR) module is applied to the 3D pose outputs by utilizing estimated results and further injects human priors simultaneously.
arXiv Detail & Related papers (2023-06-16T04:09:16Z) - Kinematic-aware Hierarchical Attention Network for Human Pose Estimation
in Videos [17.831839654593452]
Previous-based human pose estimation methods have shown promising results by leveraging features of consecutive frames.
Most approaches compromise accuracy to jitter and do not comprehend the temporal aspects of human motion.
We design an architecture that exploits kinematic keypoint features.
arXiv Detail & Related papers (2022-11-29T01:46:11Z) - Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting
Transformers [28.586258731448687]
We present a Transformer-based pose uplifting scheme that can operate on temporally sparse 2D pose sequences.
We show how masked token modeling can be utilized for temporal upsampling within Transformer blocks.
We evaluate our method on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP.
arXiv Detail & Related papers (2022-10-12T12:00:56Z) - AdaptivePose++: A Powerful Single-Stage Network for Multi-Person Pose
Regression [66.39539141222524]
We propose to represent the human parts as adaptive points and introduce a fine-grained body representation method.
With the proposed body representation, we deliver a compact single-stage multi-person pose regression network, termed as AdaptivePose.
We employ AdaptivePose for both 2D/3D multi-person pose estimation tasks to verify the effectiveness of AdaptivePose.
arXiv Detail & Related papers (2022-10-08T12:54:20Z) - P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose
Estimation [78.83305967085413]
This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task.
Our method outperforms state-of-the-art methods with fewer parameters and less computational overhead.
arXiv Detail & Related papers (2022-03-15T04:00:59Z) - 3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos.
Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure.
We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z) - JUMPS: Joints Upsampling Method for Pose Sequences [0.0]
We build on a deep generative model that combines aGenerative Adversarial Network (GAN) and an encoder.
We show experimentally that thelocalization accuracy of the additional joints is on average onpar with the original pose estimates.
arXiv Detail & Related papers (2020-07-02T14:36:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.