Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
- URL: http://arxiv.org/abs/2507.00990v2
- Date: Fri, 04 Jul 2025 14:35:12 GMT
- Title: Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
- Authors: Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, Yunzhu Li,
- Abstract summary: RIGVid enables robots to perform complex manipulation tasks by imitating AI-generated videos.<n>A video diffusion model generates potential demonstration videos, and a vision-language model automatically filters out results that do not follow the command.<n>A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion.
- Score: 19.28925489415787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks--such as pouring, wiping, and mixing--purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.
Related papers
- ORV: 4D Occupancy-centric Robot Video Generation [33.360345403049685]
Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive.<n>We propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation.<n>By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability.
arXiv Detail & Related papers (2025-06-03T17:00:32Z) - VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - VILP: Imitation Learning with Latent Video Planning [19.25411361966752]
This paper introduces imitation learning with latent video planning (VILP)<n>Our method is able to generate highly time-aligned videos from multiple views.<n>Our paper provides a practical example of how to effectively integrate video generation models into robot policies.
arXiv Detail & Related papers (2025-02-03T19:55:57Z) - VideoAgent: Self-Improving Video Generation [47.627088484395834]
Video generation has been used to generate visual plans for controlling robotic systems.<n>A major bottleneck in leveraging video generation for control lies in the quality of the generated videos.<n>We propose VideoAgent for self-improving generated video plans based on external feedback.
arXiv Detail & Related papers (2024-10-14T01:39:56Z) - DiffGen: Robot Demonstration Generation via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model [72.66465487508556]
DiffGen is a novel framework that integrates differentiable physics simulation, differentiable rendering, and a vision-language model.
It can generate realistic robot demonstrations by minimizing the distance between the embedding of the language instruction and the embedding of the simulated observation.
Experiments demonstrate that with DiffGen, we could efficiently and effectively generate robot data with minimal human effort or training time.
arXiv Detail & Related papers (2024-05-12T15:38:17Z) - Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills.
We learn our policy to generate appropriate actions given current scene observations and a video of the target task.
We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z) - Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos.
Our framework is based on predicting plausible human hand trajectories.
We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.