Related papers: EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

URL: http://arxiv.org/abs/2602.23205v1
Date: Thu, 26 Feb 2026 16:53:41 GMT
Title: EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
Authors: Wenjia Wang, Liang Pan, Huaijin Pi, Yuke Lou, Xuqian Ren, Yifan Wu, Zhouyingcheng Liao, Lei Yang, Rishabh Dabral, Christian Theobalt, Taku Komura,
Abstract summary: We propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones.<n>Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes.<n>Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via
Score: 85.77432303199176
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.

Related papers

MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction [54.36564144414704]
MeshMimic is an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video.<n>By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects.
arXiv Detail & Related papers (2026-02-17T17:09:45Z)
CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives [65.89192712575797]
We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video.<n>Our approach reduces motion tracking failure rates from 55.2% to 6.9% on human-centric video benchmarks.<n>This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale.
arXiv Detail & Related papers (2025-12-16T18:59:50Z)
From Generated Human Videos to Physically Plausible Robot Trajectories [103.28274349461607]
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts.<n>To realize this potential, how can a humanoid execute the human actions from generated videos in a zero-shot manner?<n>This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video.<n>We propose GenMimic, a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards.
arXiv Detail & Related papers (2025-12-04T18:56:03Z)
Revisit Human-Scene Interaction via Space Occupancy [55.67657438543008]
Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective. By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database.
arXiv Detail & Related papers (2023-12-05T12:03:00Z)
Synthesizing Diverse Human Motions in 3D Indoor Scenes [16.948649870341782]
We present a novel method for populating 3D indoor scenes with virtual humans that can navigate in the environment and interact with objects in a realistic manner. Existing approaches rely on training sequences that contain captured human motions and the 3D scenes they interact with. We propose a reinforcement learning-based approach that enables virtual humans to navigate in 3D scenes and interact with objects realistically and autonomously.
arXiv Detail & Related papers (2023-05-21T09:22:24Z)
CIRCLE: Capture In Rich Contextual Environments [69.97976304918149]
We propose a novel motion acquisition system in which the actor perceives and operates in a highly contextual virtual world. We present CIRCLE, a dataset containing 10 hours of full-body reaching motion from 5 subjects across nine scenes. We use this dataset to train a model that generates human motion conditioned on scene information.
arXiv Detail & Related papers (2023-03-31T09:18:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.