Related papers: FastDDHPose: Towards Unified, Efficient, and Disentangled 3D Human Pose Estimation

FastDDHPose: Towards Unified, Efficient, and Disentangled 3D Human Pose Estimation

URL: http://arxiv.org/abs/2512.14162v1
Date: Tue, 16 Dec 2025 07:47:06 GMT
Title: FastDDHPose: Towards Unified, Efficient, and Disentangled 3D Human Pose Estimation
Authors: Qingyuan Cai, Linxin Zhang, Xuecai Hu, Saihui Hou, Yongzhen Huang,
Abstract summary: We propose Fast3DHPE, a modular framework that facilitates rapid reproduction and flexible development of new methods.<n>By standardizing training and evaluation protocols, Fast3DHPE enables fair comparison across 3D human pose estimation methods.<n>Within this framework, we introduce FastDDHPose, a Disentangled Diffusion-based 3D Human Pose Estimation method.
Score: 32.94049816382114
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent approaches for monocular 3D human pose estimation (3D HPE) have achieved leading performance by directly regressing 3D poses from 2D keypoint sequences. Despite the rapid progress in 3D HPE, existing methods are typically trained and evaluated under disparate frameworks, lacking a unified framework for fair comparison. To address these limitations, we propose Fast3DHPE, a modular framework that facilitates rapid reproduction and flexible development of new methods. By standardizing training and evaluation protocols, Fast3DHPE enables fair comparison across 3D human pose estimation methods while significantly improving training efficiency. Within this framework, we introduce FastDDHPose, a Disentangled Diffusion-based 3D Human Pose Estimation method which leverages the strong latent distribution modeling capability of diffusion models to explicitly model the distributions of bone length and bone direction while avoiding further amplification of hierarchical error accumulation. Moreover, we design an efficient Kinematic-Hierarchical Spatial and Temporal Denoiser that encourages the model to focus on kinematic joint hierarchies while avoiding unnecessary modeling of overly complex joint topologies. Extensive experiments on Human3.6M and MPI-INF-3DHP show that the Fast3DHPE framework enables fair comparison of all methods while significantly improving training efficiency. Within this unified framework, FastDDHPose achieves state-of-the-art performance with strong generalization and robustness in in-the-wild scenarios. The framework and models will be released at: https://github.com/Andyen512/Fast3DHPE

Related papers

PnP-U3D: Plug-and-Play 3D Framework Bridging Autoregression and Diffusion for Unified Understanding and Generation [45.72473673810981]
We present the first unified framework for 3D understanding and generation that combines autoregression with diffusion.<n>A lightweight transformer bridges the feature space of large language models and the conditional space of 3D diffusion models.<n>Our framework achieves state-of-the-art performance across diverse 3D understanding and generation benchmarks, while also excelling in 3D editing tasks.
arXiv Detail & Related papers (2026-02-03T13:49:23Z)
Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning [34.116532190562815]
We propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Temporal Pruning (HTP) strategy.<n>HTP prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics.<n>Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5%, inference MACs by 56.8%, and improves inference speed by an average of 81.1% compared to prior diffusion-based methods.
arXiv Detail & Related papers (2025-08-29T07:08:07Z)
HyperDiff: Hypergraph Guided Diffusion Model for 3D Human Pose Estimation [15.321095223060768]
This paper introduces a novel 3D pose estimation method, HyperDiff, which integrates diffusion models with HyperGCN.<n>Results demonstrate that HyperDiff achieves state-of-the-art performance on the Human3.6M and MPI-INF-3DHP datasets.
arXiv Detail & Related papers (2025-08-20T05:03:55Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion [54.0238087499699]
We show that diffusion models enhance the accuracy, robustness, and coherence of human pose estimations. We introduce DiffHPE, a novel strategy for harnessing diffusion models in 3D-HPE. Our findings indicate that while standalone diffusion models provide commendable performance, their accuracy is even better in combination with supervised models.
arXiv Detail & Related papers (2023-09-04T12:54:10Z)
Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion Modeling [83.76377808476039]
We propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior. Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton. A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence.
arXiv Detail & Related papers (2023-08-18T16:41:57Z)
Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision. This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z)
Learned Vertex Descent: A New Direction for 3D Human Model Fitting [64.04726230507258]
We propose a novel optimization-based paradigm for 3D human model fitting on images and scans. Our approach is able to capture the underlying body of clothed people with very different body shapes, achieving a significant improvement compared to state-of-the-art. LVD is also applicable to 3D model fitting of humans and hands, for which we show a significant improvement to the SOTA with a much simpler and faster method.
arXiv Detail & Related papers (2022-05-12T17:55:51Z)
Distribution-Aware Single-Stage Models for Multi-Person 3D Pose Estimation [29.430404703883084]
We present a novel Distribution-Aware Single-stage (DAS) model for tackling the challenging multi-person 3D pose estimation problem. The proposed DAS model simultaneously localizes person positions and their corresponding body joints in the 3D camera space in a one-pass manner. Comprehensive experiments on benchmarks CMU Panoptic and MuPoTS-3D demonstrate the superior efficiency of the proposed DAS model.
arXiv Detail & Related papers (2022-03-15T07:30:27Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.