WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control
- URL: http://arxiv.org/abs/2512.11047v2
- Date: Mon, 15 Dec 2025 07:46:35 GMT
- Title: WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control
- Authors: Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi, Yanjie Zhang, Delong Li, Chuanzhe Suo, Chuang Wang, Zhihui Peng, Hongyang Li,
- Abstract summary: WholeBodyVLA is a unified framework for humanoid loco-manipulation.<n>It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%.
- Score: 22.617383494100253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To execute the desired locomotion commands more precisely, we present a loco-manipulation-oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco-manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks.
Related papers
- ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation [55.467742403416175]
We introduce a physics-driven neural algorithm that translates large-scale motion capture to humanoid embodiments.<n>We learn a unified multimodal controller that supports both dense references and sparse task specifications.<n>Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception.
arXiv Detail & Related papers (2026-03-03T18:59:29Z) - Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos [56.510263910611684]
We tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions.<n>Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors.<n>We present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data.
arXiv Detail & Related papers (2026-02-13T18:59:10Z) - CHIP: Adaptive Compliance for Humanoid Control through Hindsight Perturbation [70.5382178207975]
hIsight Perturbation (CHIP) is a plug-and-play module that enables controllable end-effector stiffness.<n>CHIP is easy to implement and requires neither data augmentation nor additional reward tuning.<n>We show that a generalist motion-tracking controller trained with CHIP can perform a diverse set of forceful manipulation tasks.
arXiv Detail & Related papers (2025-12-16T18:56:04Z) - METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model [36.82365894983052]
A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills.<n>We propose METIS, a vision-language-action model for dexterous manipulation pretrained on egocentric datasets.<n>Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks.
arXiv Detail & Related papers (2025-11-21T16:32:36Z) - DemoHLM: From One Demonstration to Generalizable Humanoid Loco-Manipulation [29.519071338337685]
We present DemoHLM, a framework for humanoid loco-manipulation on a real humanoid robot from a single demonstration in simulation.<n>whole-body controller maps whole-body motion commands to joint torques and provides omnidirectional mobility for the humanoid robot.<n> Experiments show a positive correlation between the amount of synthetic data and policy performance.
arXiv Detail & Related papers (2025-10-13T10:49:40Z) - ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning [59.64325421657381]
Humanoid whole-body loco-manipulation promises transformative capabilities for daily service and warehouse tasks.<n>We introduce ResMimic, a two-stage residual learning framework for precise and expressive humanoid control from human motion data.<n>Results show substantial gains in task success, training efficiency, and robustness over strong baselines.
arXiv Detail & Related papers (2025-10-06T17:47:02Z) - Humanoid Agent via Embodied Chain-of-Action Reasoning with Multimodal Foundation Models for Zero-Shot Loco-Manipulation [23.43820490179566]
Humanoid-COA integrates foundation model reasoning with an Embodied Chain-of-Action mechanism for zero-shot loco-manipulation.<n>Our framework substantially outperforms prior baselines across manipulation, locomotion, and loco-manipulation tasks.
arXiv Detail & Related papers (2025-04-13T11:37:32Z) - HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit [52.12750762494588]
This paper introduces HOMIE, a semi-autonomous teleoperation system.<n>It combines a reinforcement learning policy for body control mapped to a pedal, an isomorphic exoskeleton arm for arm control, and motion-sensing gloves for hand control.<n>The system is fully open-source, demos and code can be found in our https://homietele.org/.
arXiv Detail & Related papers (2025-02-18T16:33:38Z) - DexterityGen: Foundation Controller for Unprecedented Dexterity [67.15251368211361]
Teaching robots dexterous manipulation skills, such as tool use, presents a significant challenge.<n>Current approaches can be broadly categorized into two strategies: human teleoperation (for imitation learning) and sim-to-real reinforcement learning.<n>We introduce DexterityGen, which uses RL to pretrain large-scale dexterous motion primitives, such as in-hand rotation or translation.<n>In the real world, we use human teleoperation as a prompt to the controller to produce highly dexterous behavior.
arXiv Detail & Related papers (2025-02-06T18:49:35Z) - Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control [18.269588421166503]
Humanoid robots require robust lower-body locomotion and precise upper-body manipulation.<n>Recent Reinforcement Learning approaches provide whole-body loco-manipulation policies, but lack precise manipulation.<n>We introduce high-body kinematic control using inverses (IK) and motion for precise manipulation.<n>We show that CVAE features are crucial for stability and robustness, and significantly outperforms RL-based whole-body control in precise manipulation.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - Reinforcement Learning for Versatile, Dynamic, and Robust Bipedal Locomotion Control [106.32794844077534]
This paper presents a study on using deep reinforcement learning to create dynamic locomotion controllers for bipedal robots.
We develop a general control solution that can be used for a range of dynamic bipedal skills, from periodic walking and running to aperiodic jumping and standing.
This work pushes the limits of agility for bipedal robots through extensive real-world experiments.
arXiv Detail & Related papers (2024-01-30T10:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.