PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction
- URL: http://arxiv.org/abs/2510.02566v1
- Date: Thu, 02 Oct 2025 21:01:11 GMT
- Title: PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction
- Authors: Qiao Feng, Yiming Huang, Yufu Wang, Jiatao Gu, Lingjie Liu,
- Abstract summary: We present PhysHMR, a unified framework that learns a visual-to-action policy for humanoid control in a physics-based simulator.<n>A key component of our approach is the pixel-as-ray strategy, which lifts 2D keypoints into 3D spatial rays and transforms them into global space.<n>PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.
- Score: 52.44375492811009
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Reconstructing physically plausible human motion from monocular videos remains a challenging problem in computer vision and graphics. Existing methods primarily focus on kinematics-based pose estimation, often leading to unrealistic results due to the lack of physical constraints. To address such artifacts, prior methods have typically relied on physics-based post-processing following the initial kinematics-based motion estimation. However, this two-stage design introduces error accumulation, ultimately limiting the overall reconstruction quality. In this paper, we present PhysHMR, a unified framework that directly learns a visual-to-action policy for humanoid control in a physics-based simulator, enabling motion reconstruction that is both physically grounded and visually aligned with the input video. A key component of our approach is the pixel-as-ray strategy, which lifts 2D keypoints into 3D spatial rays and transforms them into global space. These rays are incorporated as policy inputs, providing robust global pose guidance without depending on noisy 3D root predictions. This soft global grounding, combined with local visual features from a pretrained encoder, allows the policy to reason over both detailed pose and global positioning. To overcome the sample inefficiency of reinforcement learning, we further introduce a distillation scheme that transfers motion knowledge from a mocap-trained expert to the vision-conditioned policy, which is then refined using physically motivated reinforcement learning rewards. Extensive experiments demonstrate that PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.
Related papers
- MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction [54.36564144414704]
MeshMimic is an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video.<n>By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects.
arXiv Detail & Related papers (2026-02-17T17:09:45Z) - PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models [100.65199317765608]
Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation.<n>We introduce a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces.<n>We extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning.
arXiv Detail & Related papers (2026-01-16T08:40:10Z) - PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding [50.454084539837005]
PhysChoreo is a novel framework that can generate videos with diverse controllability and physical realism from a single image.<n>Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction.<n>Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism.
arXiv Detail & Related papers (2025-11-25T17:59:04Z) - Half-Physics: Enabling Kinematic 3D Human Model with Physical Interactions [89.88331682333198]
We introduce a novel approach that embeds SMPL-X into a tangible entity capable of dynamic physical interactions with its surroundings.<n>Our approach maintains kinematic control over inherent SMPL-X poses while ensuring physically plausible interactions with scenes and objects.<n>Unlike reinforcement learning-based methods, which demand extensive and complex training, our half-physics method is learning-free and generalizes to any body shape and motion.
arXiv Detail & Related papers (2025-07-31T17:58:33Z) - Physics-based Human Pose Estimation from a Single Moving RGB Camera [47.50334809388003]
MoviCam is the first non-synthetic dataset containing ground-truth camera trajectories.<n> PhysDynPose is a physics-based method that incorporates scene geometry and physical constraints.<n>Our method robustly estimates both human and camera poses in world coordinates.
arXiv Detail & Related papers (2025-07-23T11:04:30Z) - Optimal-state Dynamics Estimation for Physics-based Human Motion Capture from Videos [6.093379844890164]
We propose a novel method to selectively incorporate the physics models with the kinematics observations in an online setting.<n>A recurrent neural network is introduced to realize a Kalman filter that attentively balances the kinematics input and simulated motion.<n>The proposed approach excels in the physics-based human pose estimation task and demonstrates the physical plausibility of the predictive dynamics.
arXiv Detail & Related papers (2024-10-10T10:24:59Z) - Trajectory Optimization for Physics-Based Reconstruction of 3d Human
Pose from Monocular Video [31.96672354594643]
We focus on the task of estimating a physically plausible articulated human motion from monocular video.
Existing approaches that do not consider physics often produce temporally inconsistent output with motion artifacts.
We show that our approach achieves competitive results with respect to existing physics-based methods on the Human3.6M benchmark.
arXiv Detail & Related papers (2022-05-24T18:02:49Z) - Deep Physics-aware Inference of Cloth Deformation for Monocular Human
Performance Capture [84.73946704272113]
We show how integrating physics into the training process improves the learned cloth deformations and allows modeling clothing as a separate piece of geometry.
Our approach leads to a significant improvement over current state-of-the-art methods and is thus a clear step towards realistic monocular capture of the entire deforming surface of a human clothed.
arXiv Detail & Related papers (2020-11-25T16:46:00Z) - Contact and Human Dynamics from Monocular Video [73.47466545178396]
Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors.
We present a physics-based method for inferring 3D human motion from video sequences that takes initial 2D and 3D pose estimates as input.
arXiv Detail & Related papers (2020-07-22T21:09:11Z) - Occlusion resistant learning of intuitive physics from videos [52.25308231683798]
Key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation.
This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences.
arXiv Detail & Related papers (2020-04-30T19:35:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.