Motion Tracks: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning
- URL: http://arxiv.org/abs/2501.06994v1
- Date: Mon, 13 Jan 2025 01:01:44 GMT
- Title: Motion Tracks: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning
- Authors: Juntao Ren, Priya Sundaresan, Dorsa Sadigh, Sanjiban Choudhury, Jeannette Bohg,
- Abstract summary: We propose to represent actions as short-horizon 2D trajectories on an image.
These actions, or motion tracks, capture the predicted direction of motion for either human hands or robot end-effectors.
We instantiate an IL policy called Motion Track Policy (MT-pi) which receives image observations and outputs motion tracks as actions.
- Score: 40.43176821917154
- License:
- Abstract: Teaching robots to autonomously complete everyday tasks remains a challenge. Imitation Learning (IL) is a powerful approach that imbues robots with skills via demonstrations, but is limited by the labor-intensive process of collecting teleoperated robot data. Human videos offer a scalable alternative, but it remains difficult to directly train IL policies from them due to the lack of robot action labels. To address this, we propose to represent actions as short-horizon 2D trajectories on an image. These actions, or motion tracks, capture the predicted direction of motion for either human hands or robot end-effectors. We instantiate an IL policy called Motion Track Policy (MT-pi) which receives image observations and outputs motion tracks as actions. By leveraging this unified, cross-embodiment action space, MT-pi completes tasks with high success given just minutes of human video and limited additional robot demonstrations. At test time, we predict motion tracks from two camera views, recovering 6DoF trajectories via multi-view synthesis. MT-pi achieves an average success rate of 86.5% across 4 real-world tasks, outperforming state-of-the-art IL baselines which do not leverage human data or our action space by 40%, and generalizes to scenarios seen only in human videos. Code and videos are available on our website https://portal-cornell.github.io/motion_track_policy/.
Related papers
- Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction [51.49400490437258]
This work develops a method for imitating articulated object manipulation from a single monocular RGB human demonstration.
We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video.
Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion.
We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot.
arXiv Detail & Related papers (2024-09-26T17:57:16Z) - Whole-Body Teleoperation for Mobile Manipulation at Zero Added Cost [8.71539730969424]
MoMa-Teleop is a novel teleoperation method that infers end-effector motions from existing interfaces.
We demonstrate that our approach results in a significant reduction in task completion time across a variety of robots and tasks.
arXiv Detail & Related papers (2024-09-23T15:09:45Z) - One-Shot Imitation under Mismatched Execution [7.060120660671016]
Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks.
translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities.
We propose RHyME, a novel framework that automatically aligns human and robot task executions using optimal transport costs.
arXiv Detail & Related papers (2024-09-10T16:11:57Z) - Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers [36.497624484863785]
We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions.
Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos.
We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos.
arXiv Detail & Related papers (2024-03-19T17:47:37Z) - Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation [34.65637397405485]
We present Human to Humanoid (H2O), a framework that enables real-time whole-body teleoperation of a humanoid robot with only an RGB camera.
We train a robust real-time humanoid motion imitator in simulation using these refined motions and transfer it to the real humanoid robot in a zero-shot manner.
To the best of our knowledge, this is the first demonstration to achieve learning-based real-time whole-body humanoid teleoperation.
arXiv Detail & Related papers (2024-03-07T12:10:41Z) - Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets.
We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z) - Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills.
We learn our policy to generate appropriate actions given current scene observations and a video of the target task.
We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z) - Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos.
Our framework is based on predicting plausible human hand trajectories.
We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.