From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots
- URL: http://arxiv.org/abs/2506.12779v3
- Date: Tue, 02 Sep 2025 12:06:20 GMT
- Title: From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots
- Authors: Yuxuan Wang, Ming Yang, Ziluo Ding, Yu Zhang, Weishuai Zeng, Xinrun Xu, Haobin Jiang, Zongqing Lu,
- Abstract summary: BumbleBee is an expert-generalist learning framework that combines motion clustering and sim-to-real adaptation.<n> Experiments on two simulations and a real humanoid robot demonstrate that BB achieves state-of-the-art general whole-body control.
- Score: 35.26305396688982
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving general agile whole-body control on humanoid robots remains a major challenge due to diverse motion demands and data conflicts. While existing frameworks excel in training single motion-specific policies, they struggle to generalize across highly varied behaviors due to conflicting control requirements and mismatched data distributions. In this work, we propose BumbleBee (BB), an expert-generalist learning framework that combines motion clustering and sim-to-real adaptation to overcome these challenges. BB first leverages an autoencoder-based clustering method to group behaviorally similar motions using motion features and motion descriptions. Expert policies are then trained within each cluster and refined with real-world data through iterative delta action modeling to bridge the sim-to-real gap. Finally, these experts are distilled into a unified generalist controller that preserves agility and robustness across all motion types. Experiments on two simulations and a real humanoid robot demonstrate that BB achieves state-of-the-art general whole-body control, setting a new benchmark for agile, robust, and generalizable humanoid performance in the real world. The project webpage is available at https://beingbeyond.github.io/BumbleBee/.
Related papers
- ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation [55.467742403416175]
We introduce a physics-driven neural algorithm that translates large-scale motion capture to humanoid embodiments.<n>We learn a unified multimodal controller that supports both dense references and sparse task specifications.<n>Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception.
arXiv Detail & Related papers (2026-03-03T18:59:29Z) - Embodiment-Aware Generalist Specialist Distillation for Unified Humanoid Whole-Body Control [34.056581843277904]
We introduce an iterative generalist-specialist distillation framework that produces a single unified policy that controls multiple humanoids.<n>We conducted experiments on five different robots in simulation and four in real-world settings.
arXiv Detail & Related papers (2026-02-03T00:58:29Z) - FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions [147.04372611893032]
We present FRoM-W1, an open-source framework designed to achieve general humanoid whole-body motion control using natural language.<n>We extensively evaluate FRoM-W1 on Unitree H1 and G1 robots.<n>Results demonstrate superior performance on the HumanML3D-X benchmark for human whole-body motion generation.
arXiv Detail & Related papers (2026-01-19T07:59:32Z) - HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies [83.41714103649751]
Development of embodied intelligence models depends on access to high-quality robot demonstration data.<n>We present HiMoE-VLA, a novel vision-language-action framework tailored to handle diverse robotic data with heterogeneity.<n>HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalizations.
arXiv Detail & Related papers (2025-12-05T13:21:05Z) - DemoHLM: From One Demonstration to Generalizable Humanoid Loco-Manipulation [29.519071338337685]
We present DemoHLM, a framework for humanoid loco-manipulation on a real humanoid robot from a single demonstration in simulation.<n>whole-body controller maps whole-body motion commands to joint torques and provides omnidirectional mobility for the humanoid robot.<n> Experiments show a positive correlation between the amount of synthetic data and policy performance.
arXiv Detail & Related papers (2025-10-13T10:49:40Z) - KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control [30.738592041595933]
We present VMS, a unified whole-body controller that enables humanoid robots to learn diverse and dynamic behaviors within a single policy.<n>Our framework integrates a hybrid tracking objective that balances local motion fidelity with global trajectory consistency.<n>We validate VMS specialization extensively in both simulation and real-world experiments, demonstrating accurate imitation of dynamic skills, stable performance over minute-long sequences, and strong generalization to unseen motions.
arXiv Detail & Related papers (2025-09-20T11:31:14Z) - GBC: Generalized Behavior-Cloning Framework for Whole-Body Humanoid Imitation [5.426712963311386]
Generalized Behavior Cloning (GBC) is a comprehensive and unified solution designed to solve this end-to-end challenge.<n>First, an adaptive data pipeline leverages a differentiable IK network to automatically retarget any human MoCap data to any humanoid.<n>Second, our novel DAgger-MMPPO algorithm with its MMTransformer architecture learns robust, high-fidelity imitation policies.
arXiv Detail & Related papers (2025-08-13T17:28:39Z) - Modular Recurrence in Contextual MDPs for Universal Morphology Control [0.0]
Generalization to new, unseen robots, however, remains a challenge.<n>We implement a modular recurrent architecture and evaluate its generalization performance on a large set of MuJoCo robots.
arXiv Detail & Related papers (2025-06-10T09:44:30Z) - GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z) - ModSkill: Physical Character Skill Modularization [21.33764810227885]
We introduce a novel skill learning framework, ModSkill, that decouples complex full-body skills into compositional, modular skills for independent body parts.<n>Our results show that this modularized skill learning framework, enhanced by generative sampling, outperforms existing methods in precise full-body motion tracking.
arXiv Detail & Related papers (2025-02-19T22:55:49Z) - ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills [46.16771391136412]
ASAP is a two-stage framework designed to tackle the dynamics mismatch and enable agile humanoid whole-body skills.<n>In the first stage, we pre-train motion tracking policies in simulation using retargeted human motion data.<n>In the second stage, we deploy the policies in the real world and collect real-world data to train a delta (residual) action model.
arXiv Detail & Related papers (2025-02-03T08:22:46Z) - Universal Actions for Enhanced Embodied Foundation Models [25.755178700280933]
We introduce UniAct, a new embodied foundation modeling framework operating in a Universal Action Space.<n>Our learned universal actions capture the generic atomic behaviors across diverse robots by exploiting their shared structural features.<n>Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundation models in extensive evaluations on various real-world and simulation robots.
arXiv Detail & Related papers (2025-01-17T10:45:22Z) - The One RING: a Robotic Indoor Navigation Generalist [58.30694487843546]
RING (Robotic Indoor Navigation Generalist) is an embodiment-agnostic policy that turns any mobile robot into an effective indoor semantic navigator.<n>Trained entirely in simulation, RING leverages large-scale randomization over robot embodiments to enable robust generalization to many real-world platforms.
arXiv Detail & Related papers (2024-12-18T23:15:41Z) - CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation [43.12717215650305]
We present CrowdMoGen, the first zero-shot framework for collective motion generation.<n>CrowdMoGen effectively groups individuals and generates event-aligned motion sequences from text prompts.<n>As the first framework of collective motion generation, CrowdMoGen has the potential to advance applications in urban simulation, crowd planning, and other large-scale interactive environments.
arXiv Detail & Related papers (2024-07-08T17:59:36Z) - RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis [102.1876259853457]
We propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX.
RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints.
To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning.
arXiv Detail & Related papers (2024-02-25T15:31:43Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Model Predictive Control for Fluid Human-to-Robot Handovers [50.72520769938633]
Planning motions that take human comfort into account is not a part of the human-robot handover process.
We propose to generate smooth motions via an efficient model-predictive control framework.
We conduct human-to-robot handover experiments on a diverse set of objects with several users.
arXiv Detail & Related papers (2022-03-31T23:08:20Z) - On the Emergence of Whole-body Strategies from Humanoid Robot
Push-recovery Learning [32.070068456106895]
We apply model-free Deep Reinforcement Learning for training a general and robust humanoid push-recovery policy in a simulation environment.
Our method targets high-dimensional whole-body humanoid control and is validated on the iCub humanoid.
arXiv Detail & Related papers (2021-04-29T17:49:20Z) - Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation.
A core challenge is to generalize the manipulation skills to objects in different locations.
We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.