Related papers: DreamPose3D: Hallucinative Diffusion with Prompt Learning for 3D Human Pose Estimation

DreamPose3D: Hallucinative Diffusion with Prompt Learning for 3D Human Pose Estimation

URL: http://arxiv.org/abs/2511.09502v1
Date: Thu, 13 Nov 2025 01:58:37 GMT
Title: DreamPose3D: Hallucinative Diffusion with Prompt Learning for 3D Human Pose Estimation
Authors: Jerrin Bright, Yuhao Chen, John S. Zelek,
Abstract summary: We introduce DreamPose3D, a diffusion-based framework that combines action-aware reasoning with temporal imagination for 3D pose estimation.<n>DreamPose3D dynamically conditions the denoising process using task-relevant action prompts extracted from 2D pose sequences, capturing high-level intent.<n>To model the structural relationships between joints effectively, we introduce a representation encoder that incorporates kinematic joint affinity into the attention mechanism.<n>A hallucinative pose decoder predicts temporally coherent 3D pose during training, simulating how humans mentally reconstruct motion trajectories to resolve ambiguity in perception.
Score: 18.240580699213197
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accurate 3D human pose estimation remains a critical yet unresolved challenge, requiring both temporal coherence across frames and fine-grained modeling of joint relationships. However, most existing methods rely solely on geometric cues and predict each 3D pose independently, which limits their ability to resolve ambiguous motions and generalize to real-world scenarios. Inspired by how humans understand and anticipate motion, we introduce DreamPose3D, a diffusion-based framework that combines action-aware reasoning with temporal imagination for 3D pose estimation. DreamPose3D dynamically conditions the denoising process using task-relevant action prompts extracted from 2D pose sequences, capturing high-level intent. To model the structural relationships between joints effectively, we introduce a representation encoder that incorporates kinematic joint affinity into the attention mechanism. Finally, a hallucinative pose decoder predicts temporally coherent 3D pose sequences during training, simulating how humans mentally reconstruct motion trajectories to resolve ambiguity in perception. Extensive experiments on benchmarked Human3.6M and MPI-3DHP datasets demonstrate state-of-the-art performance across all metrics. To further validate DreamPose3D's robustness, we tested it on a broadcast baseball dataset, where it demonstrated strong performance despite ambiguous and noisy 2D inputs, effectively handling temporal consistency and intent-driven motion variations.

Related papers

StarPose: 3D Human Pose Estimation via Spatial-Temporal Autoregressive Diffusion [29.682018018059043]
StarPose is an autoregressive diffusion framework for 3D human pose estimation.<n>It incorporates historical 3D pose predictions and spatial-temporal physical guidance.<n>It achieves superior accuracy and temporal consistency in 3D human pose estimation.
arXiv Detail & Related papers (2025-08-04T04:50:05Z)
PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data [1.264462543503282]
PoseSyn is a novel data synthesis framework that transforms abundant in the wild 2D pose dataset into diverse 3D pose image pairs.<n>By generating realistic 3D training data via a human animation model aligned with challenging poses and appearances PoseSyn boosts the accuracy of various 3D pose estimators by up to 14%.
arXiv Detail & Related papers (2025-03-17T10:28:35Z)
Lifting Motion to the 3D World via 2D Diffusion [19.64801640086107]
We introduce MVLift, a novel approach to predict global 3D motion using only 2D pose sequences for training.<n> MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses.
arXiv Detail & Related papers (2024-11-27T23:26:56Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion Modeling [83.76377808476039]
We propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior. Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton. A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence.
arXiv Detail & Related papers (2023-08-18T16:41:57Z)
HuMoR: 3D Human Motion Model for Robust Pose Estimation [100.55369985297797]
HuMoR is a 3D Human Motion Model for Robust Estimation of temporal pose and shape. We introduce a conditional variational autoencoder, which learns a distribution of the change in pose at each step of a motion sequence. We demonstrate that our model generalizes to diverse motions and body shapes after training on a large motion capture dataset.
arXiv Detail & Related papers (2021-05-10T21:04:55Z)
Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage Optimization [33.02708860641971]
Estimating 3D human poses from a monocular video is still a challenging task. Many existing methods drop when the target person is cluded by other objects, or the motion is too fast/slow relative to the scale and speed of the training data. We introduce atemporal-temporal network for robust 3D human pose estimation.
arXiv Detail & Related papers (2020-10-13T15:24:28Z)
Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation [52.94078950641959]
We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation. We adopt a novel neural representation of multi-person 3D pose which unifies the position of person instances with their corresponding 3D pose representation. We propose a practical deployment paradigm where paired 2D or 3D pose annotations are unavailable.
arXiv Detail & Related papers (2020-08-04T07:54:25Z)
Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation [58.72192168935338]
Generalizability of human pose estimation models developed using supervision on large-scale in-studio datasets remains questionable. We propose a novel kinematic-structure-preserved unsupervised 3D pose estimation framework, which is not restrained by any paired or unpaired weak supervisions. Our proposed model employs three consecutive differentiable transformations named as forward-kinematics, camera-projection and spatial-map transformation.
arXiv Detail & Related papers (2020-06-24T23:56:33Z)
Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames. Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
A Graph Attention Spatio-temporal Convolutional Network for 3D Human Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms. Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.