Large-Scale Actionless Video Pre-Training via Discrete Diffusion for
Efficient Policy Learning
- URL: http://arxiv.org/abs/2402.14407v1
- Date: Thu, 22 Feb 2024 09:48:47 GMT
- Title: Large-Scale Actionless Video Pre-Training via Discrete Diffusion for
Efficient Policy Learning
- Authors: Haoran He, Chenjia Bai, Ling Pan, Weinan Zhang, Bin Zhao, Xuelong Li
- Abstract summary: We introduce a novel framework that combines generative pre-training on human videos and policy fine-tuning on action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
- Score: 73.69573252516761
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning a generalist embodied agent capable of completing multiple tasks
poses challenges, primarily stemming from the scarcity of action-labeled
robotic datasets. In contrast, a vast amount of human videos exist, capturing
intricate tasks and interactions with the physical world. Promising prospects
arise for utilizing actionless human videos for pre-training and transferring
the knowledge to facilitate robot policy learning through limited robot
demonstrations. In this paper, we introduce a novel framework that leverages a
unified discrete diffusion to combine generative pre-training on human videos
and policy fine-tuning on a small number of action-labeled robot videos. We
start by compressing both human and robot videos into unified video tokens. In
the pre-training stage, we employ a discrete diffusion model with a
mask-and-replace diffusion strategy to predict future video tokens in the
latent space. In the fine-tuning stage, we harness the imagined future videos
to guide low-level action learning trained on a limited set of robot data.
Experiments demonstrate that our method generates high-fidelity future videos
for planning and enhances the fine-tuned policies compared to previous
state-of-the-art approaches with superior generalization ability. Our project
website is available at https://video-diff.github.io/.
Related papers
- Track2Act: Predicting Point Tracks from Internet Videos enables Diverse Zero-shot Robot Manipulation [65.46610405509338]
Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses.
We show that this approach of combining scalably learned track prediction with a residual policy enables zero-shot robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers [36.497624484863785]
We introduce Vid2Robot, a novel end-to-end video-based learning framework for robots.
Given a video demonstration of a manipulation task and current visual observations, Vid2Robot directly produces robot actions.
This is achieved through a unified representation model trained on a large dataset of human video and robot trajectory.
arXiv Detail & Related papers (2024-03-19T17:47:37Z) - Towards Generalizable Zero-Shot Manipulation via Translating Human
Interaction Plans [58.27029676638521]
We show how passive human videos can serve as a rich source of data for learning such generalist robots.
We learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations.
We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects.
arXiv Detail & Related papers (2023-12-01T18:54:12Z) - Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills.
We learn our policy to generate appropriate actions given current scene observations and a video of the target task.
We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z) - Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact.
The structure of these behavioral affordances directly enables the robot to perform many complex tasks.
We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z) - Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos.
Our framework is based on predicting plausible human hand trajectories.
We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.