Related papers: H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

URL: http://arxiv.org/abs/2507.23523v2
Date: Fri, 01 Aug 2025 06:30:43 GMT
Title: H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation
Authors: Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, Jun Zhu,
Abstract summary: H-RDT (Human to Robotics Diffusion Transformer) is a novel approach that leverages human manipulation data to enhance robot manipulation capabilities.<n>Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies.<n>We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders.
Score: 27.585828712261232
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Imitation learning for robotic manipulation faces a fundamental challenge: the scarcity of large-scale, high-quality robot demonstration data. Recent robotic foundation models often pre-train on cross-embodiment robot datasets to increase data scale, while they face significant limitations as the diverse morphologies and action spaces across different robot embodiments make unified training challenging. In this paper, we present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies and can benefit robotic policy learning. We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders. Built on a diffusion transformer architecture with 2B parameters, H-RDT uses flow matching to model complex action distributions. Extensive evaluations encompassing both simulation and real-world experiments, single-task and multitask scenarios, as well as few-shot learning and robustness assessments, demonstrate that H-RDT outperforms training from scratch and existing state-of-the-art methods, including Pi0 and RDT, achieving significant improvements of 13.9% and 40.5% over training from scratch in simulation and real-world experiments, respectively. The results validate our core hypothesis that human manipulation data can serve as a powerful foundation for learning bimanual robotic manipulation policies.

Related papers

MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training [102.850162490626]
We propose MiVLA, a vision-language-action model empowered by human-robot mutual imitation pre-training.<n>We show that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs.
arXiv Detail & Related papers (2025-12-17T12:59:41Z)
CEDex: Cross-Embodiment Dexterous Grasp Generation at Scale from Human-like Contact Representations [53.37721117405022]
Cross-embodiment dexterous grasp synthesis refers to adaptively generating and optimizing grasps for various robotic hands.<n>We propose CEDex, a novel cross-embodiment dexterous grasp synthesis method at scale.<n>We construct the largest cross-embodiment grasp dataset to date, comprising 500K objects across four types with 20M total grasps.
arXiv Detail & Related papers (2025-09-29T12:08:04Z)
Is Diversity All You Need for Scalable Robotic Manipulation? [50.747150672933316]
We investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better"<n>We show that task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios.<n>We propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data.
arXiv Detail & Related papers (2025-07-08T17:52:44Z)
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics [55.05920313034645]
We introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control.<n>Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions.<n>Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks.
arXiv Detail & Related papers (2025-05-29T16:41:12Z)
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z)
Human-Humanoid Robots Cross-Embodiment Behavior-Skill Transfer Using Decomposed Adversarial Learning from Demonstration [9.42179962375058]
We propose a transferable framework that reduces the data bottleneck by using a unified digital human model as a common prototype.<n>The model learns behavior primitives from human demonstrations through adversarial imitation, and complex robot structures are decomposed into functional components.<n>Our framework is validated on five humanoid robots with diverse configurations.
arXiv Detail & Related papers (2024-12-19T18:41:45Z)
VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency. Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z)
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets [24.77850617214567]
We propose a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss.
arXiv Detail & Related papers (2024-10-29T17:58:13Z)
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation [23.554917579133576]
We present Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation.<n>RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer.<n>We further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots.
arXiv Detail & Related papers (2024-10-10T12:33:46Z)
Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation [16.809190349155525]
We propose a novel adaptation paradigm that leverages readily available paired human-robot video data to bridge the domain gap.<n>Our method employs a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robot domain in a parameter-efficient manner.
arXiv Detail & Related papers (2024-06-20T11:57:46Z)
Human-Robot Skill Transfer with Enhanced Compliance via Dynamic Movement Primitives [1.7901837062462316]
We introduce a systematic method to extract the dynamic features from human demonstration to auto-tune the parameters in the Dynamic Movement Primitives framework. Our method was implemented into an actual human-robot setup to extract human dynamic features and used to regenerate the robot trajectories following both LfD and RL.
arXiv Detail & Related papers (2023-04-12T08:48:28Z)
HERD: Continuous Human-to-Robot Evolution for Learning from Human Demonstration [57.045140028275036]
We show that manipulation skills can be transferred from a human to a robot through the use of micro-evolutionary reinforcement learning. We propose an algorithm for multi-dimensional evolution path searching that allows joint optimization of both the robot evolution path and the policy.
arXiv Detail & Related papers (2022-12-08T15:56:13Z)
Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works. However, learning a model that captures the dynamics of complex skills represents a major challenge. We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.