PriorFormer: A Transformer for Real-time Monocular 3D Human Pose Estimation with Versatile Geometric Priors
- URL: http://arxiv.org/abs/2508.18238v1
- Date: Thu, 21 Aug 2025 08:16:14 GMT
- Title: PriorFormer: A Transformer for Real-time Monocular 3D Human Pose Estimation with Versatile Geometric Priors
- Authors: Mohamed Adjel, Vincent Bonnet,
- Abstract summary: This paper proposes a new lightweight Transformer-based lifter that maps short sequences of human 2D joint positions to 3D poses using a single camera.<n>The proposed model takes as input geometric priors including segment lengths and camera intrinsics and is designed to operate in both calibrated and uncalibrated settings.
- Score: 1.4932318540666545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a new lightweight Transformer-based lifter that maps short sequences of human 2D joint positions to 3D poses using a single camera. The proposed model takes as input geometric priors including segment lengths and camera intrinsics and is designed to operate in both calibrated and uncalibrated settings. To this end, a masking mechanism enables the model to ignore missing priors during training and inference. This yields a single versatile network that can adapt to different deployment scenarios, from fully calibrated lab environments to in-the-wild monocular videos without calibration. The model was trained using 3D keypoints from AMASS dataset with corresponding 2D synthetic data generated by sampling random camera poses and intrinsics. It was then compared to an expert model trained, only on complete priors, and the validation was done by conducting an ablation study. Results show that both, camera and segment length priors, improve performance and that the versatile model outperforms the expert, even when all priors are available, and maintains high accuracy when priors are missing. Overall the average 3D joint center positions estimation accuracy was as low as 36mm improving state of the art by half a centimeter and at a much lower computational cost. Indeed, the proposed model runs in 380$\mu$s on GPU and 1800$\mu$s on CPU, making it suitable for deployment on embedded platforms and low-power devices.
Related papers
- UnPose: Uncertainty-Guided Diffusion Priors for Zero-Shot Pose Estimation [19.76147681894604]
UnPose is a framework for zero-shot, model-free 6D object pose estimation and reconstruction.<n>It exploits 3D priors and uncertainty estimates from a pre-trained diffusion model.<n>It significantly outperforms existing approaches in both 6D pose estimation accuracy and 3D reconstruction quality.
arXiv Detail & Related papers (2025-08-21T21:31:04Z) - Toward a Real-Time Framework for Accurate Monocular 3D Human Pose Estimation with Geometric Priors [0.0]
We propose a framework that combines real-time 2D keypoint detection with geometry-aware 2D-to-3D lifting.<n>We discuss how these ingredients can enable fast, personalized, and accurate 3D pose estimation from monocular images without requiring specialized hardware.
arXiv Detail & Related papers (2025-07-21T08:18:23Z) - E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models [78.1674905950243]
We present the first comprehensive benchmark for 3D geometric foundation models (GFMs)<n>GFMs directly predict dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters.<n>We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains.<n>All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
arXiv Detail & Related papers (2025-06-02T17:53:09Z) - Online Test-time Adaptation for 3D Human Pose Estimation: A Practical Perspective with Estimated 2D Poses [40.21976058922288]
Online test-time adaptation for 3D human pose estimation is used for video streams that differ from training data.<n>Ground truth 2D poses are used for adaptation, but only estimated 2D poses are available in practice.<n>This paper addresses adapting models to streaming videos with estimated 2D poses.
arXiv Detail & Related papers (2025-03-14T08:41:55Z) - CameraHMR: Aligning People with Perspective [54.05758012879385]
We address the challenge of accurate 3D human pose and shape estimation from monocular images.
Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations.
We make two contributions that improve pGT accuracy.
arXiv Detail & Related papers (2024-11-12T19:12:12Z) - GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers [23.96688843662126]
Reconstructing posed 3D human models from monocular images has important applications in the sports industry.<n>We combine 3D human pose and shape estimation with 3D Gaussian Splatting (3DGS), a representation of the scene composed of a mixture of Gaussians.<n>We show that this combination can achieve near real-time inference of 3D human models from a single image without expensive diffusion models or 3D points supervision.
arXiv Detail & Related papers (2024-09-06T11:34:24Z) - Scene-Aware 3D Multi-Human Motion Capture from a Single Camera [83.06768487435818]
We consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera.
We leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks.
In particular, we estimate the scene depth and unique person scale from normalized disparity predictions using the 2D body joints and joint angles.
arXiv Detail & Related papers (2023-01-12T18:01:28Z) - MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision [72.5863451123577]
We show how to train a neural model that can perform accurate 3D pose and camera estimation.
Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines.
arXiv Detail & Related papers (2021-08-10T18:39:56Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.