RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation
- URL: http://arxiv.org/abs/2506.22007v1
- Date: Fri, 27 Jun 2025 08:21:55 GMT
- Title: RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation
- Authors: Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, Abhinav Valada,
- Abstract summary: We address the problem of generating long-horizon videos for robotic manipulation tasks.<n>We propose a novel pipeline that bypasses the need for autoregressive generation.<n>Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency.
- Score: 30.252593687028767
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.
Related papers
- Plenoptic Video Generation [80.3116444692858]
We introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain synchronization-temporal memory.<n>The core idea is to train a multi-in-out video-conditioned model in an autoregressive manner.<n>Our training incorporates context-scaling to improve convergence, self-conditioning to hallucinations caused by error accumulation, and a long-video conditioning mechanism to support extended video generation.
arXiv Detail & Related papers (2026-01-08T18:58:32Z) - Large Video Planner Enables Generalizable Robot Control [117.49024534548319]
General-purpose robots require decision-making models that generalize across diverse tasks and environments.<n>Recent works build robot foundation models by extending multimodal large language models (LMs) with action outputs, creating vision--action (VLA) systems.<n>We explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models.
arXiv Detail & Related papers (2025-12-17T18:35:54Z) - DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos [24.681248200255975]
Video models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation.<n>We present DRAW2ACT, a trajectory-conditioned video generation framework that extracts multiple representations from the input trajectory.<n>We show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.
arXiv Detail & Related papers (2025-12-16T09:11:36Z) - From Generated Human Videos to Physically Plausible Robot Trajectories [103.28274349461607]
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts.<n>To realize this potential, how can a humanoid execute the human actions from generated videos in a zero-shot manner?<n>This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video.<n>We propose GenMimic, a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards.
arXiv Detail & Related papers (2025-12-04T18:56:03Z) - RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z) - Image Generation as a Visual Planner for Robotic Manipulation [0.0]
Generating realistic robotic manipulation videos is an important step toward unifying perception, planning, and action in embodied agents.<n>We propose a two-part framework that includes: (1) text-conditioned generation, which uses a language instruction and the first frame, and (2) trajectory-conditioned generation, which uses a 2D trajectory overlay and the same initial frame.<n>Our findings indicate that pretrained image generators encode transferable temporal priors and can function as video-like robotic planners under minimal supervision.
arXiv Detail & Related papers (2025-11-29T15:54:16Z) - Self-Forcing++: Towards Minute-Scale High-Quality Video Generation [50.945885467651216]
Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality.<n>Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers.<n>We propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets.
arXiv Detail & Related papers (2025-10-02T17:55:42Z) - VideoMAR: Autoregressive Video Generatio with Continuous Tokens [33.906543515428424]
Masked-based autoregressive models have demonstrated promising image generation capability in continuous space.<n>We propose textbfVideoMAR, a decoder-only autoregressive image-to-video model with continuous tokens.<n>VideoMAR surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters.
arXiv Detail & Related papers (2025-06-17T04:08:18Z) - RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping [26.010205882976624]
RoboSwap operates on unpaired data from diverse environments.<n>We segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another.<n>Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks.
arXiv Detail & Related papers (2025-06-10T09:46:07Z) - Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z) - DreamGen: Unlocking Generalization in Robot Learning through Video World Models [120.25799361925387]
DreamGen is a pipeline for training robot policies that generalize across behaviors and environments through neural trajectories.<n>Our work establishes a promising new axis for scaling robot learning well beyond manual data collection.
arXiv Detail & Related papers (2025-05-19T04:55:39Z) - VILP: Imitation Learning with Latent Video Planning [19.25411361966752]
This paper introduces imitation learning with latent video planning (VILP)<n>Our method is able to generate highly time-aligned videos from multiple views.<n>Our paper provides a practical example of how to effectively integrate video generation models into robot policies.
arXiv Detail & Related papers (2025-02-03T19:55:57Z) - Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability.
In this work, we build Snap Video, a video-first model that systematically addresses these challenges.
We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead.
This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z) - Video Language Planning [137.06052217713054]
Video language planning is an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models.
Our algorithm produces detailed multimodal (video and language) specifications that describe how to complete the final task.
It substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots.
arXiv Detail & Related papers (2023-10-16T17:48:45Z) - Temporally Consistent Transformers for Video Generation [80.45230642225913]
To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world.
No established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies.
We introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time.
arXiv Detail & Related papers (2022-10-05T17:15:10Z) - STPOTR: Simultaneous Human Trajectory and Pose Prediction Using a Non-Autoregressive Transformer for Robot Following Ahead [12.177604596741773]
We develop a neural network model to predict future human motion from an observed human motion history.<n>We propose a non-autoregressive transformer architecture to leverage its parallel nature for easier training and fast, accurate predictions at test time.<n>Our model is well-suited for robotic applications in terms of test accuracy and speed favorably with respect to state-of-the-art methods.
arXiv Detail & Related papers (2022-09-15T20:27:54Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Future Frame Prediction for Robot-assisted Surgery [57.18185972461453]
We propose a ternary prior guided variational autoencoder (TPG-VAE) model for future frame prediction in robotic surgical video sequences.
Besides content distribution, our model learns motion distribution, which is novel to handle the small movements of surgical tools.
arXiv Detail & Related papers (2021-03-18T15:12:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.