Related papers: CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

URL: http://arxiv.org/abs/2512.14696v2
Date: Sun, 21 Dec 2025 20:38:23 GMT
Title: CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives
Authors: Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins, Shubham Tulsiani, Deva Ramanan,
Abstract summary: We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video.<n>Our approach reduces motion tracking failure rates from 55.2% to 6.9% on human-centric video benchmarks.<n>This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale.
Score: 65.89192712575797
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2\% to 6.9\% on human-centric video benchmarks (EMDB, PROX), while delivering a 43\% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.

Related papers

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors [51.06020148149403]
We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors.<n>ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded.
arXiv Detail & Related papers (2026-03-04T17:58:04Z)
EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents [85.77432303199176]
We propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones.<n>Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes.<n>Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via
arXiv Detail & Related papers (2026-02-26T16:53:41Z)
MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction [54.36564144414704]
MeshMimic is an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video.<n>By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects.
arXiv Detail & Related papers (2026-02-17T17:09:45Z)
From Generated Human Videos to Physically Plausible Robot Trajectories [103.28274349461607]
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts.<n>To realize this potential, how can a humanoid execute the human actions from generated videos in a zero-shot manner?<n>This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video.<n>We propose GenMimic, a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards.
arXiv Detail & Related papers (2025-12-04T18:56:03Z)
SHARE: Scene-Human Aligned Reconstruction [10.764401463569442]
We introduce Scene-Human Aligned REconstruction, a technique that leverages the scene geometry's inherent spatial cues to accurately ground human motion reconstruction.<n>It iteratively refines the human's positions at theses by comparing the human mesh against a human point map extracted from the scene using the mask.<n>Our approach enables more accurate 3D human placement while reconstructing the surrounding scene, facilitating use cases on both curated datasets and in-the-wild web videos.
arXiv Detail & Related papers (2025-10-17T06:12:10Z)
HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers [60.86393841247567]
HumanRAM is a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images.<n>Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions.<n> Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets.
arXiv Detail & Related papers (2025-06-03T17:50:05Z)
Joint Optimization for 4D Human-Scene Reconstruction in the Wild [59.322951972876716]
We propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos.<n>Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction.<n>We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos.
arXiv Detail & Related papers (2025-01-04T01:53:51Z)
Physics-based Scene Layout Generation from Human Motion [21.939444709132395]
We present a physics-based approach that simultaneously optimize a scene layout generator and simulates a moving human in a physics simulator. We use reinforcement learning to perform a dual-optimization of both the character motion imitation controller and the scene layout generator. We evaluate our method using motions from SAMP and PROX, and demonstrate physically plausible scene layout reconstruction compared with the previous kinematics-based method.
arXiv Detail & Related papers (2024-05-21T02:36:37Z)
SimEndoGS: Efficient Data-driven Scene Simulation using Robotic Surgery Videos via Physics-embedded 3D Gaussians [19.590481146949685]
We introduce 3D Gaussian as a learnable representation for surgical scene, which is learned from stereo endoscopic video. We apply the Material Point Method, which is integrated with physical properties, to the 3D Gaussians to achieve realistic scene deformations. Results show that it can reconstruct and simulate surgical scenes from endoscopic videos efficiently-taking only a few minutes to reconstruct the surgical scene.
arXiv Detail & Related papers (2024-05-02T02:34:19Z)
Learning Motion Priors for 4D Human Body Capture in 3D Scenes [81.54377747405812]
We propose LEMO: LEarning human MOtion priors for 4D human body capture. We introduce a novel motion prior, which reduces the jitters exhibited by poses recovered over a sequence. We also design a contact friction term and a contact-aware motion infiller obtained via per-instance self-supervised training. With our pipeline, we demonstrate high-quality 4D human body capture, reconstructing smooth motions and physically plausible body-scene interactions.
arXiv Detail & Related papers (2021-08-23T20:47:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.