Related papers: TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning

TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning

URL: http://arxiv.org/abs/2509.11839v2
Date: Wed, 17 Sep 2025 03:11:49 GMT
Title: TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning
Authors: Jiacheng Liu, Pengxiang Ding, Qihang Zhou, Yuxuan Wu, Da Huang, Zimian Peng, Wei Xiao, Weinan Zhang, Lixin Yang, Cewu Lu, Donglin Wang,
Abstract summary: We present TrajBooster, a cross-embodiment framework that leverages abundant wheeled-humanoid data to boost bipedal VLA.<n>Our key idea is to use end-effector trajectories as a morphology-agnostic interface.<n>Results show that TrajBooster allows existing wheeled-humanoid data to efficiently strengthen bipedal humanoid VLA performance.
Score: 79.59753528758361
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent Vision-Language-Action models show potential to generalize across embodiments but struggle to quickly align with a new robot's action space when high-quality demonstrations are scarce, especially for bipedal humanoids. We present TrajBooster, a cross-embodiment framework that leverages abundant wheeled-humanoid data to boost bipedal VLA. Our key idea is to use end-effector trajectories as a morphology-agnostic interface. TrajBooster (i) extracts 6D dual-arm end-effector trajectories from real-world wheeled humanoids, (ii) retargets them in simulation to Unitree G1 with a whole-body controller trained via a heuristic-enhanced harmonized online DAgger to lift low-dimensional trajectory references into feasible high-dimensional whole-body actions, and (iii) forms heterogeneous triplets that couple source vision/language with target humanoid-compatible actions to post-pre-train a VLA, followed by only 10 minutes of teleoperation data collection on the target humanoid domain. Deployed on Unitree G1, our policy achieves beyond-tabletop household tasks, enabling squatting, cross-height manipulation, and coordinated whole-body motion with markedly improved robustness and generalization. Results show that TrajBooster allows existing wheeled-humanoid data to efficiently strengthen bipedal humanoid VLA performance, reducing reliance on costly same-embodiment data while enhancing action space understanding and zero-shot skill transfer capabilities. For more details, For more details, please refer to our \href{https://jiachengliu3.github.io/TrajBooster/}.

Related papers

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation [55.467742403416175]
We introduce a physics-driven neural algorithm that translates large-scale motion capture to humanoid embodiments.<n>We learn a unified multimodal controller that supports both dense references and sparse task specifications.<n>Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception.
arXiv Detail & Related papers (2026-03-03T18:59:29Z)
ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning [59.64325421657381]
Humanoid whole-body loco-manipulation promises transformative capabilities for daily service and warehouse tasks.<n>We introduce ResMimic, a two-stage residual learning framework for precise and expressive humanoid control from human motion data.<n>Results show substantial gains in task success, training efficiency, and robustness over strong baselines.
arXiv Detail & Related papers (2025-10-06T17:47:02Z)
OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction [76.44108003274955]
A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning policies.<n>We introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh.<n>By minimizing the Laplacian deformation between the human and robot meshes, OmniRetarget generates kinematically feasible trajectories.
arXiv Detail & Related papers (2025-09-30T17:59:02Z)
RobotDancing: Residual-Action Reinforcement Learning Enables Robust Long-Horizon Humanoid Motion Tracking [50.200035833530876]
RobotDancing is a simple, scalable framework that predicts residual joint targets to explicitly correct dynamics discrepancies.<n>It can track multi-minute, high-energy behaviors (jumps, spins, cartwheels) and deploys zero-shot to hardware with high motion tracking quality.
arXiv Detail & Related papers (2025-09-25T03:30:34Z)
Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos [66.62109400603394]
We introduce Being-H0, a dexterous Vision-Language-Action model trained on large-scale human videos.<n>Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks.<n>We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes.
arXiv Detail & Related papers (2025-07-21T13:19:09Z)
Conditioning Matters: Training Diffusion Policies is Faster Than You Think [69.31534053485711]
Diffusion policies have emerged as a mainstream paradigm for building vision-language-action (VLA) models.<n>We identify a fundamental challenge in conditional diffusion policy training: when generative conditions are hard to distinguish, the training objective degenerates into modeling the marginal action distribution.<n>We propose Cocos, a solution that modifies the source distribution in the conditional flow matching to be condition-dependent.
arXiv Detail & Related papers (2025-05-16T11:14:22Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
H-GAP: Humanoid Control with a Generalist Planner [45.50995825122686]
Humanoid Generalist Autoencoding Planner (H-GAP) is a generative model trained on humanoid trajectories derived from human motioncaptured data. For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviours. We also do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing.
arXiv Detail & Related papers (2023-12-05T11:40:24Z)
DTC: Deep Tracking Control [16.2850135844455]
We propose a hybrid control architecture that combines the advantages of both worlds to achieve greater robustness, foot-placement accuracy, and terrain generalization. A deep neural network policy is trained in simulation, aiming to track the optimized footholds. We demonstrate superior robustness in the presence of slippery or deformable ground when compared to model-based counterparts.
arXiv Detail & Related papers (2023-09-27T07:57:37Z)
Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior [51.60683890503293]
We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling. We show that pure supervised cloning can generate trajectories matching the per-time step distribution of arbitrary expert trajectories.
arXiv Detail & Related papers (2023-07-27T04:27:26Z)
Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for 3D Human Pose Estimation [11.592567773739407]
3D human pose estimation technologies have the potential to greatly increase the availability of human movement data. The best-performing models for single-image 2D-3D lifting use graph convolutional networks (GCNs) that typically require some manual input to define the relationships between different body joints. We propose a novel transformer-based approach that uses the more generalised self-attention mechanism to learn these relationships.
arXiv Detail & Related papers (2022-08-07T12:07:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.