Related papers: X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

URL: http://arxiv.org/abs/2511.04671v1
Date: Thu, 06 Nov 2025 18:56:30 GMT
Title: X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations
Authors: Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du, Edward W. Duan, Wei-Chiu Ma, Kushal Kedia,
Abstract summary: X-Diffusion is a principled framework for training diffusion policies.<n>It maximally leverages human data without learning dynamically infeasible motions.<n>X-Diffusion achieves a 16% higher average success rate than the best baseline.
Score: 12.375737659812344
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human videos can be recorded quickly and at scale, making them an appealing source of training data for robot learning. However, humans and robots differ fundamentally in embodiment, resulting in mismatched action execution. Direct kinematic retargeting of human hand motion can therefore produce actions that are physically infeasible for robots. Despite these low-level differences, human demonstrations provide valuable motion cues about how to manipulate and interact with objects. Our key idea is to exploit the forward diffusion process: as noise is added to actions, low-level execution differences fade while high-level task guidance is preserved. We present X-Diffusion, a principled framework for training diffusion policies that maximally leverages human data without learning dynamically infeasible motions. X-Diffusion first trains a classifier to predict whether a noisy action is executed by a human or robot. Then, a human action is incorporated into policy training only after adding sufficient noise such that the classifier cannot discern its embodiment. Actions consistent with robot execution supervise fine-grained denoising at low noise levels, while mismatched human actions provide only coarse guidance at higher noise levels. Our experiments show that naive co-training under execution mismatches degrades policy performance, while X-Diffusion consistently improves it. Across five manipulation tasks, X-Diffusion achieves a 16% higher average success rate than the best baseline. The project website is available at https://portal-cornell.github.io/X-Diffusion/.

Related papers

Flow Policy Gradients for Robot Control [67.61978635211048]
Flow matching policy gradients can be made effective for training and fine-tuning more expressive policies.<n>We show how policies can exploit the flow representation for exploration when training from scratch, as well as improved fine-tuning robustness over baselines.
arXiv Detail & Related papers (2026-02-02T18:56:49Z)
Mitty: Diffusion-based Human-to-Robot Video Generation [57.494785199352975]
We present Mitty, a Diffusion Transformer that enables video In-Context Learning for end-to-end Human2Robot video generation.<n>Built on a pretrained video diffusion model, Mitty leverages strong visual-temporal priors to translate human demonstrations into robot-execution videos without action labels or intermediate abstractions.<n> Experiments on Human2Robot and EPIC-Kitchens show that Mitty delivers state-of-the-art results, strong generalization to unseen environments, and new insights for scalable robot learning from human observations.
arXiv Detail & Related papers (2025-12-19T05:52:15Z)
ViPRA: Video Prediction for Robot Actions [33.310474967770894]
We present Video Prediction for Robot Actions (ViPRA), a framework that learns continuous robot control from actionless videos.<n>Instead of directly predicting actions, we train a video-language model to predict both future visual observations and motion-centric latent actions.<n>For downstream control, we introduce a chunked flow matching decoder that maps latent actions to robot-specific continuous action sequences.
arXiv Detail & Related papers (2025-11-11T01:33:03Z)
MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training [40.45924128424013]
We propose MimicDreamer, a framework that turns low-cost human demonstrations into robot-usable supervision.<n>For visual alignment, we propose H2R Aligner, a video diffusion model that generates high-fidelity robot demonstration videos.<n>For viewpoint stabilization, EgoStabilizer is proposed, which canonicalizes egocentric videos via homography.<n>For action alignment, we map human hand trajectories to the robot frame and apply a constrained inverse kinematics solver.
arXiv Detail & Related papers (2025-09-26T11:05:10Z)
DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy [33.18108154271181]
We propose DemoDiffusion, a simple and scalable method for enabling robots to perform manipulation tasks in natural environments.<n>Our approach is based on two key insights. First, the hand motion in a human demonstration provides a useful prior for the robot's end-effector trajectory.<n>Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context.
arXiv Detail & Related papers (2025-06-25T17:59:01Z)
One-Shot Imitation under Mismatched Execution [7.060120660671016]
Human demonstrations are a powerful way to program robots to do long-horizon manipulation tasks.<n> translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities.<n>We propose RHyME, a novel framework that automatically pairs human and robot trajectories using sequence-level optimal transport cost functions.
arXiv Detail & Related papers (2024-09-10T16:11:57Z)
Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z)
Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills. We learn our policy to generate appropriate actions given current scene observations and a video of the target task. We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z)
Self-Improving Robots: End-to-End Autonomous Visuomotor Reinforcement Learning [54.636562516974884]
In imitation and reinforcement learning, the cost of human supervision limits the amount of data that robots can be trained on. In this work, we propose MEDAL++, a novel design for self-improving robotic systems. The robot autonomously practices the task by learning to both do and undo the task, simultaneously inferring the reward function from the demonstrations.
arXiv Detail & Related papers (2023-03-02T18:51:38Z)
Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos. Our framework is based on predicting plausible human hand trajectories. We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.