DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction
- URL: http://arxiv.org/abs/2603.03265v1
- Date: Tue, 03 Mar 2026 18:54:17 GMT
- Title: DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction
- Authors: Yufu Wang, Evonne Ng, Soyong Shin, Rawal Khirodkar, Yuan Dong, Zhaoen Su, Jinhyung Park, Kris Kitani, Alexander Richard, Fabian Prada, Michael Zollhofer,
- Abstract summary: DuoMo is a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations.<n>Our approach addresses the problem by factorizing motion learning into two diffusion models.<n>Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations.
- Score: 73.7305982336243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: https://yufu-wang.github.io/duomo/
Related papers
- Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining [49.223455189395025]
Mocap-2-to-3 is a novel framework that performs multi-view lifting from monocular input.<n>To leverage abundant 2D data, we decompose complex 3D motion into multi-view syntheses.<n>Our method surpasses state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning.
arXiv Detail & Related papers (2025-03-05T06:32:49Z) - World-Grounded Human Motion Recovery via Gravity-View Coordinates [60.618543026949226]
We propose estimating human poses in a novel Gravity-View coordinate system.
The proposed GV system is naturally gravity-aligned and uniquely defined for each video frame.
Our method recovers more realistic motion in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-09-10T17:25:47Z) - TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos [46.11545135199594]
TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans.
We introduce a video transformer model to regress the kinematic body motion of a human.
arXiv Detail & Related papers (2024-03-26T03:10:45Z) - RoHM: Robust Human Motion Reconstruction via Diffusion [58.63706638272891]
RoHM is an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos.
It conditioned on noisy and occluded input data, reconstructs complete, plausible motions in consistent global coordinates.
Our method outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time.
arXiv Detail & Related papers (2024-01-16T18:57:50Z) - GraMMaR: Ground-aware Motion Model for 3D Human Motion Reconstruction [61.833152949826946]
We propose a novel Ground-aware Motion Model for 3D Human Motion Reconstruction, named GraMMaR.
GraMMaR learns the distribution of transitions in both pose and interaction between every joint and ground plane at each time step of a motion sequence.
It is trained to explicitly promote consistency between the motion and distance change towards the ground.
arXiv Detail & Related papers (2023-06-29T07:22:20Z) - Decoupling Human and Camera Motion from Videos in the Wild [67.39432972193929]
We propose a method to reconstruct global human trajectories from videos in the wild.
Our method decouples the camera and human motion, which allows us to place people in the same world coordinate frame.
arXiv Detail & Related papers (2023-02-24T18:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.