Related papers: Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

URL: http://arxiv.org/abs/2406.16862v1
Date: Mon, 24 Jun 2024 17:59:45 GMT
Title: Dreamitate: Real-World Visuomotor Policy Learning via Video Generation
Authors: Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, Carl Vondrick,
Abstract summary: We propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. We generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot.
Score: 49.03287909942888
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. At test time, we generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot. Our key insight is that using common tools allows us to effortlessly bridge the embodiment gap between the human hand and the robot manipulator. We evaluate our approach on four tasks of increasing complexity and demonstrate that harnessing internet-scale generative models allows the learned policy to achieve a significantly higher degree of generalization than existing behavior cloning approaches.

Related papers

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators [86.70243911696616]
Generalization in robot manipulation is essential for deploying robots in open-world environments.<n>We present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators.
arXiv Detail & Related papers (2025-12-07T18:57:15Z)
UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning [22.84748754972181]
Building generalist robot policies that can handle diverse tasks in open-ended environments is a central challenge in robotics.<n>To leverage knowledge from large-scale pretraining, prior work has typically built generalist policies either on top of vision-language understanding models (VLMs) or generative models.<n>Recent unified models of generation and understanding have demonstrated strong capabilities in both comprehension and generation through large-scale pretraining.<n>We introduce UniCoD, which acquires the ability to dynamically model high-dimensional visual features through pretraining on over 1M internet-scale instructional manipulation videos.
arXiv Detail & Related papers (2025-10-12T14:54:19Z)
Grounding Robot Policies with Visuomotor Language Guidance [15.774237279917594]
We propose an agent-based framework for grounding robot policies to the current context. The proposed framework is composed of a set of conversational agents designed for specific roles. We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates.
arXiv Detail & Related papers (2024-10-09T02:00:37Z)
Hand-Object Interaction Pretraining from Videos [77.92637809322231]
We learn general robot manipulation priors from 3D hand-object interaction trajectories. We do so by sharing both the human hand and the manipulated object in 3D space and human motions to robot actions. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches.
arXiv Detail & Related papers (2024-09-12T17:59:07Z)
View-Invariant Policy Learning via Zero-Shot Novel View Synthesis [26.231630397802785]
We investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. We study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments.
arXiv Detail & Related papers (2024-09-05T16:39:21Z)
Learning to Act from Actionless Videos through Dense Correspondences [87.1243107115642]
We present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments. Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks.
arXiv Detail & Related papers (2023-10-12T17:59:23Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models. Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning. Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills. We learn our policy to generate appropriate actions given current scene observations and a video of the target task. We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z)
Self-Supervised Learning of Multi-Object Keypoints for Robotic Manipulation [8.939008609565368]
In this paper, we demonstrate the efficacy of learning image keypoints via the Dense Correspondence pretext task for downstream policy learning. We evaluate our approach on diverse robot manipulation tasks, compare it to other visual representation learning approaches, and demonstrate its flexibility and effectiveness for sample-efficient policy learning.
arXiv Detail & Related papers (2022-05-17T13:15:07Z)
Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots. We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector. We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.