ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control
- URL: http://arxiv.org/abs/2602.00401v1
- Date: Fri, 30 Jan 2026 23:35:02 GMT
- Title: ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control
- Authors: Jean Pierre Sleiman, He Li, Alphonsus Adu-Bredu, Robin Deits, Arun Kumar, Kevin Bergamin, Mohak Bhardwaj, Scott Biddlestone, Nicola Burger, Matthew A. Estrada, Francesco Iacobelli, Twan Koolen, Alexander Lambert, Erica Lin, M. Eva Mungai, Zach Nobles, Shane Rozen-Levy, Yuyao Shi, Jiashun Wang, Jakob Welner, Fangzhou Yu, Mike Zhang, Alfred Rizzi, Jessica Hodgins, Sylvain Bertrand, Yeuhi Abe, Scott Kuindersma, Farbod Farshidian,
- Abstract summary: We introduce ZEST, a streamlined motion-imitation framework that trains policies via reinforcement learning from diverse sources.<n>ZEST generalizes across behaviors and platforms while avoiding contact labels, reference or observation windows, state estimators, and extensive reward shaping.<n>On Boston Dynamics' Atlas humanoid, ZEST learns dynamic, multi-contact skills (e.g., army crawl, breakdancing) from motion capture.
- Score: 37.4764082674475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Achieving robust, human-like whole-body control on humanoid robots for agile, contact-rich behaviors remains a central challenge, demanding heavy per-skill engineering and a brittle process of tuning controllers. We introduce ZEST (Zero-shot Embodied Skill Transfer), a streamlined motion-imitation framework that trains policies via reinforcement learning from diverse sources -- high-fidelity motion capture, noisy monocular video, and non-physics-constrained animation -- and deploys them to hardware zero-shot. ZEST generalizes across behaviors and platforms while avoiding contact labels, reference or observation windows, state estimators, and extensive reward shaping. Its training pipeline combines adaptive sampling, which focuses training on difficult motion segments, and an automatic curriculum using a model-based assistive wrench, together enabling dynamic, long-horizon maneuvers. We further provide a procedure for selecting joint-level gains from approximate analytical armature values for closed-chain actuators, along with a refined model of actuators. Trained entirely in simulation with moderate domain randomization, ZEST demonstrates remarkable generality. On Boston Dynamics' Atlas humanoid, ZEST learns dynamic, multi-contact skills (e.g., army crawl, breakdancing) from motion capture. It transfers expressive dance and scene-interaction skills, such as box-climbing, directly from videos to Atlas and the Unitree G1. Furthermore, it extends across morphologies to the Spot quadruped, enabling acrobatics, such as a continuous backflip, through animation. Together, these results demonstrate robust zero-shot deployment across heterogeneous data sources and embodiments, establishing ZEST as a scalable interface between biological movements and their robotic counterparts.
Related papers
- UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots [27.794309591475326]
A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility.<n>Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency.<n>This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions.
arXiv Detail & Related papers (2025-12-30T16:20:13Z) - FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation [50.39748673817223]
We introduce two training-free, inference-time techniques that fully exploit explicit action parameters in robot video generation.<n>First, action-scaled classifier-free guidance dynamically modulates guidance strength in proportion to action magnitude, enhancing controllability over motion intensity.<n>Second, action-scaled noise truncation adjusts the distribution of initially sampled noise to better align with the desired motion dynamics.
arXiv Detail & Related papers (2025-09-29T03:30:40Z) - RobotDancing: Residual-Action Reinforcement Learning Enables Robust Long-Horizon Humanoid Motion Tracking [50.200035833530876]
RobotDancing is a simple, scalable framework that predicts residual joint targets to explicitly correct dynamics discrepancies.<n>It can track multi-minute, high-energy behaviors (jumps, spins, cartwheels) and deploys zero-shot to hardware with high motion tracking quality.
arXiv Detail & Related papers (2025-09-25T03:30:34Z) - KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control [30.738592041595933]
We present VMS, a unified whole-body controller that enables humanoid robots to learn diverse and dynamic behaviors within a single policy.<n>Our framework integrates a hybrid tracking objective that balances local motion fidelity with global trajectory consistency.<n>We validate VMS specialization extensively in both simulation and real-world experiments, demonstrating accurate imitation of dynamic skills, stable performance over minute-long sequences, and strong generalization to unseen motions.
arXiv Detail & Related papers (2025-09-20T11:31:14Z) - Physical Autoregressive Model for Robotic Manipulation without Action Pretraining [65.8971623698511]
We build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR)<n>PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining.<n>Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task.
arXiv Detail & Related papers (2025-08-13T13:54:51Z) - Learning to Move in Rhythm: Task-Conditioned Motion Policies with Orbital Stability Guarantees [45.137864140049814]
We introduce Orbitally Stable Motion Primitives (OSMPs) - a framework that combines a learned diffeomorphic encoder with a supercritical Hopf bifurcation in latent space.<n>We validate the proposed approach through extensive simulation and real-world experiments across a diverse range of robotic platforms.
arXiv Detail & Related papers (2025-07-12T17:10:03Z) - Learning Humanoid Standing-up Control across Diverse Postures [27.79222176982376]
Standing-up control is crucial for humanoid robots, with the potential for integration into current locomotion and loco-manipulation systems.<n>We present HoST (Humanoid Standing-up Control), a reinforcement learning framework that learns standing-up control from scratch.<n>Our experimental results demonstrate that the controllers achieve smooth, stable, and robust standing-up motions across a wide range of laboratory and outdoor environments.
arXiv Detail & Related papers (2025-02-12T13:10:09Z) - Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - An Adaptable Approach to Learn Realistic Legged Locomotion without
Examples [38.81854337592694]
This work proposes a generic approach for ensuring realism in locomotion by guiding the learning process with the spring-loaded inverted pendulum model as a reference.
We present experimental results showing that even in a model-free setup, the learned policies can generate realistic and energy-efficient locomotion gaits for a bipedal and a quadrupedal robot.
arXiv Detail & Related papers (2021-10-28T10:14:47Z) - UniCon: Universal Neural Controller For Physics-based Character Motion [70.45421551688332]
We propose a physics-based universal neural controller (UniCon) that learns to master thousands of motions with different styles by learning on large-scale motion datasets.
UniCon can support keyboard-driven control, compose motion sequences drawn from a large pool of locomotion and acrobatics skills and teleport a person captured on video to a physics-based virtual avatar.
arXiv Detail & Related papers (2020-11-30T18:51:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.