TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning
- URL: http://arxiv.org/abs/2509.11839v2
- Date: Wed, 17 Sep 2025 03:11:49 GMT
- Title: TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning
- Authors: Jiacheng Liu, Pengxiang Ding, Qihang Zhou, Yuxuan Wu, Da Huang, Zimian Peng, Wei Xiao, Weinan Zhang, Lixin Yang, Cewu Lu, Donglin Wang,
- Abstract summary: We present TrajBooster, a cross-embodiment framework that leverages abundant wheeled-humanoid data to boost bipedal VLA.<n>Our key idea is to use end-effector trajectories as a morphology-agnostic interface.<n>Results show that TrajBooster allows existing wheeled-humanoid data to efficiently strengthen bipedal humanoid VLA performance.
- Score: 79.59753528758361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent Vision-Language-Action models show potential to generalize across embodiments but struggle to quickly align with a new robot's action space when high-quality demonstrations are scarce, especially for bipedal humanoids. We present TrajBooster, a cross-embodiment framework that leverages abundant wheeled-humanoid data to boost bipedal VLA. Our key idea is to use end-effector trajectories as a morphology-agnostic interface. TrajBooster (i) extracts 6D dual-arm end-effector trajectories from real-world wheeled humanoids, (ii) retargets them in simulation to Unitree G1 with a whole-body controller trained via a heuristic-enhanced harmonized online DAgger to lift low-dimensional trajectory references into feasible high-dimensional whole-body actions, and (iii) forms heterogeneous triplets that couple source vision/language with target humanoid-compatible actions to post-pre-train a VLA, followed by only 10 minutes of teleoperation data collection on the target humanoid domain. Deployed on Unitree G1, our policy achieves beyond-tabletop household tasks, enabling squatting, cross-height manipulation, and coordinated whole-body motion with markedly improved robustness and generalization. Results show that TrajBooster allows existing wheeled-humanoid data to efficiently strengthen bipedal humanoid VLA performance, reducing reliance on costly same-embodiment data while enhancing action space understanding and zero-shot skill transfer capabilities. For more details, For more details, please refer to our \href{https://jiachengliu3.github.io/TrajBooster/}.
Related papers
- ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation [55.467742403416175]
We introduce a physics-driven neural algorithm that translates large-scale motion capture to humanoid embodiments.<n>We learn a unified multimodal controller that supports both dense references and sparse task specifications.<n>Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception.
arXiv Detail & Related papers (2026-03-03T18:59:29Z) - ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning [59.64325421657381]
Humanoid whole-body loco-manipulation promises transformative capabilities for daily service and warehouse tasks.<n>We introduce ResMimic, a two-stage residual learning framework for precise and expressive humanoid control from human motion data.<n>Results show substantial gains in task success, training efficiency, and robustness over strong baselines.
arXiv Detail & Related papers (2025-10-06T17:47:02Z) - OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction [76.44108003274955]
A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning policies.<n>We introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh.<n>By minimizing the Laplacian deformation between the human and robot meshes, OmniRetarget generates kinematically feasible trajectories.
arXiv Detail & Related papers (2025-09-30T17:59:02Z) - RobotDancing: Residual-Action Reinforcement Learning Enables Robust Long-Horizon Humanoid Motion Tracking [50.200035833530876]
RobotDancing is a simple, scalable framework that predicts residual joint targets to explicitly correct dynamics discrepancies.<n>It can track multi-minute, high-energy behaviors (jumps, spins, cartwheels) and deploys zero-shot to hardware with high motion tracking quality.
arXiv Detail & Related papers (2025-09-25T03:30:34Z) - Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos [66.62109400603394]
We introduce Being-H0, a dexterous Vision-Language-Action model trained on large-scale human videos.<n>Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks.<n>We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes.
arXiv Detail & Related papers (2025-07-21T13:19:09Z) - Conditioning Matters: Training Diffusion Policies is Faster Than You Think [69.31534053485711]
Diffusion policies have emerged as a mainstream paradigm for building vision-language-action (VLA) models.<n>We identify a fundamental challenge in conditional diffusion policy training: when generative conditions are hard to distinguish, the training objective degenerates into modeling the marginal action distribution.<n>We propose Cocos, a solution that modifies the source distribution in the conditional flow matching to be condition-dependent.
arXiv Detail & Related papers (2025-05-16T11:14:22Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - H-GAP: Humanoid Control with a Generalist Planner [45.50995825122686]
Humanoid Generalist Autoencoding Planner (H-GAP) is a generative model trained on humanoid trajectories derived from human motioncaptured data.
For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviours.
We also do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing.
arXiv Detail & Related papers (2023-12-05T11:40:24Z) - DTC: Deep Tracking Control [16.2850135844455]
We propose a hybrid control architecture that combines the advantages of both worlds to achieve greater robustness, foot-placement accuracy, and terrain generalization.
A deep neural network policy is trained in simulation, aiming to track the optimized footholds.
We demonstrate superior robustness in the presence of slippery or deformable ground when compared to model-based counterparts.
arXiv Detail & Related papers (2023-09-27T07:57:37Z) - Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level
Stability and High-Level Behavior [51.60683890503293]
We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling.
We show that pure supervised cloning can generate trajectories matching the per-time step distribution of arbitrary expert trajectories.
arXiv Detail & Related papers (2023-07-27T04:27:26Z) - Jointformer: Single-Frame Lifting Transformer with Error Prediction and
Refinement for 3D Human Pose Estimation [11.592567773739407]
3D human pose estimation technologies have the potential to greatly increase the availability of human movement data.
The best-performing models for single-image 2D-3D lifting use graph convolutional networks (GCNs) that typically require some manual input to define the relationships between different body joints.
We propose a novel transformer-based approach that uses the more generalised self-attention mechanism to learn these relationships.
arXiv Detail & Related papers (2022-08-07T12:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.