VideoAgent: Self-Improving Video Generation
- URL: http://arxiv.org/abs/2410.10076v3
- Date: Sun, 09 Feb 2025 05:57:42 GMT
- Title: VideoAgent: Self-Improving Video Generation
- Authors: Achint Soni, Sreyas Venkataraman, Abhranil Chandra, Sebastian Fischmeister, Percy Liang, Bo Dai, Sherry Yang,
- Abstract summary: Video generation has been used to generate visual plans for controlling robotic systems.<n>A major bottleneck in leveraging video generation for control lies in the quality of the generated videos.<n>We propose VideoAgent for self-improving generated video plans based on external feedback.
- Score: 47.627088484395834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the generated video plans using a novel procedure which we call self-conditioning consistency, allowing inference-time compute to be turned into better generated video plans. As the refined video plan is being executed, VideoAgent can collect additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robots can be an effective tool in grounding video generation in the physical world. Video demos and code can be found at https://video-as-agent.github.io.
Related papers
- Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations [19.28925489415787]
RIGVid enables robots to perform complex manipulation tasks by imitating AI-generated videos.<n>A video diffusion model generates potential demonstration videos, and a vision-language model automatically filters out results that do not follow the command.<n>A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion.
arXiv Detail & Related papers (2025-07-01T17:39:59Z) - DreamGen: Unlocking Generalization in Robot Learning through Video World Models [120.25799361925387]
DreamGen is a pipeline for training robot policies that generalize across behaviors and environments through neural trajectories.<n>Our work establishes a promising new axis for scaling robot learning well beyond manual data collection.
arXiv Detail & Related papers (2025-05-19T04:55:39Z) - Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models [49.4824734958566]
Chain-of-Modality (CoM) enables Vision Language Models to reason about multimodal human demonstration data.
CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt.
arXiv Detail & Related papers (2025-04-17T21:31:23Z) - VILP: Imitation Learning with Latent Video Planning [19.25411361966752]
This paper introduces imitation learning with latent video planning (VILP)
Our method is able to generate highly time-aligned videos from multiple views.
Our paper provides a practical example of how to effectively integrate video generation models into robot policies.
arXiv Detail & Related papers (2025-02-03T19:55:57Z) - Generative Video Propagation [87.15843701018099]
Our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image-to-video generation model.
Experiment results demonstrate the leading performance of our model in various video tasks.
arXiv Detail & Related papers (2024-12-27T17:42:29Z) - Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation [74.70013315714336]
Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video.
Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.
arXiv Detail & Related papers (2024-09-24T17:57:33Z) - Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation [4.147294190096431]
We introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations.
Given a natural language description of a video, multiple VLM agents auto-direct various processes of the generation pipeline.
Our generated videos show better quality than commercial video generation models in 5 metrics on video quality and instruction-following performance.
arXiv Detail & Related papers (2024-08-19T23:31:02Z) - Vision-based Manipulation from Single Human Video with Open-World Object Graphs [58.23098483464538]
We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos.
We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video.
arXiv Detail & Related papers (2024-05-30T17:56:54Z) - Reframe Anything: LLM Agent for Open World Video Reframing [0.8424099022563256]
We introduce Reframe Any Video Agent (RAVA), an AI-based agent that restructures visual content for video reframing.
RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video.
Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.
arXiv Detail & Related papers (2024-03-10T03:29:56Z) - Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets.
We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z) - Video Language Planning [137.06052217713054]
Video language planning is an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models.
Our algorithm produces detailed multimodal (video and language) specifications that describe how to complete the final task.
It substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots.
arXiv Detail & Related papers (2023-10-16T17:48:45Z) - Style-transfer based Speech and Audio-visual Scene Understanding for
Robot Action Sequence Acquisition from Videos [40.012813353904875]
This paper introduces a method for robot action sequence generation from instruction videos.
We built a system that accomplishes various cooking actions, where an arm robot executes a sequence acquired from a cooking video.
arXiv Detail & Related papers (2023-06-27T17:37:53Z) - Probabilistic Adaptation of Text-to-Video Models [181.84311524681536]
Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model.
Video Adapter is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data.
arXiv Detail & Related papers (2023-06-02T19:00:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.