RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion
- URL: http://arxiv.org/abs/2512.23649v3
- Date: Sun, 04 Jan 2026 07:45:19 GMT
- Title: RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion
- Authors: Zhe Li, Cheng Chi, Boan Zhu, Yangyang Wei, Shuanghao Bai, Yuheng Ji, Yibo Peng, Tao Huang, Pengwei Wang, Zhongyuan Wang, S. -H. Gary Chan, Chang Xu, Shanghang Zhang,
- Abstract summary: State-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or text commands.<n>We propose RoboMirror, the first-free video-to-locomotion framework embodying "understand before you imitate"
- Score: 59.51253426975907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying "understand before you imitate". Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate the effectiveness of RoboMirror, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80%, and achieves a 3.7% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.
Related papers
- Mitty: Diffusion-based Human-to-Robot Video Generation [57.494785199352975]
We present Mitty, a Diffusion Transformer that enables video In-Context Learning for end-to-end Human2Robot video generation.<n>Built on a pretrained video diffusion model, Mitty leverages strong visual-temporal priors to translate human demonstrations into robot-execution videos without action labels or intermediate abstractions.<n> Experiments on Human2Robot and EPIC-Kitchens show that Mitty delivers state-of-the-art results, strong generalization to unseen environments, and new insights for scalable robot learning from human observations.
arXiv Detail & Related papers (2025-12-19T05:52:15Z) - MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training [40.45924128424013]
We propose MimicDreamer, a framework that turns low-cost human demonstrations into robot-usable supervision.<n>For visual alignment, we propose H2R Aligner, a video diffusion model that generates high-fidelity robot demonstration videos.<n>For viewpoint stabilization, EgoStabilizer is proposed, which canonicalizes egocentric videos via homography.<n>For action alignment, we map human hand trajectories to the robot frame and apply a constrained inverse kinematics solver.
arXiv Detail & Related papers (2025-09-26T11:05:10Z) - Object-centric 3D Motion Field for Robot Learning from Human Videos [56.9436352861611]
We propose to use object-centric 3D motion field to represent actions for robot learning from human videos.<n>We present a novel framework for extracting this representation from videos for zero-shot control.<n> Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method.
arXiv Detail & Related papers (2025-06-04T17:59:06Z) - Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos [101.26467307473638]
We introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer.<n>We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge.<n>To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control.
arXiv Detail & Related papers (2024-12-05T18:57:04Z) - Universal Humanoid Motion Representations for Physics-Based Control [71.46142106079292]
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control.
We first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset.
We then create our motion representation by distilling skills directly from the imitator.
arXiv Detail & Related papers (2023-10-06T20:48:43Z) - Learning Object Manipulation Skills from Video via Approximate
Differentiable Physics [27.923004421974156]
We teach robots to perform simple object manipulation tasks by watching a single video demonstration.
A differentiable scene ensures perceptual fidelity between the 3D scene and the 2D video.
We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations.
arXiv Detail & Related papers (2022-08-03T10:21:47Z) - Look Closer: Bridging Egocentric and Third-Person Views with
Transformers for Robotic Manipulation [15.632809977544907]
Learning to solve precision-based manipulation tasks from visual feedback could drastically reduce the engineering efforts required by traditional robot systems.
We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot's wrist.
To fuse visual information from both cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism.
arXiv Detail & Related papers (2022-01-19T18:39:03Z) - High-Fidelity Neural Human Motion Transfer from Monocular Video [71.75576402562247]
Video-based human motion transfer creates video animations of humans following a source motion.
We present a new framework which performs high-fidelity and temporally-consistent human motion transfer with natural pose-dependent non-rigid deformations.
In the experimental results, we significantly outperform the state-of-the-art in terms of video realism.
arXiv Detail & Related papers (2020-12-20T16:54:38Z) - Understanding Action Sequences based on Video Captioning for
Learning-from-Observation [14.467714234267307]
We propose a Learning-from-Observation framework that splits and understands a video of a human demonstration with verbal instructions to extract accurate action sequences.
The splitting is done based on local minimum points of the hand velocity, which align human daily-life actions with object-centered face contact transitions.
We match the motion descriptions with the verbal instructions to understand the correct human intent and ignore the unintended actions inside the video.
arXiv Detail & Related papers (2020-12-09T05:22:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.