CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion
- URL: http://arxiv.org/abs/2512.16023v1
- Date: Wed, 17 Dec 2025 23:16:02 GMT
- Title: CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion
- Authors: Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Ziyuan Liu, Abhinav Valada,
- Abstract summary: We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states.<n>Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning.
- Score: 27.567059323636112
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.
Related papers
- Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z) - Vidar: Embodied Video Diffusion Model for Generalist Manipulation [28.216910600346512]
Vidar is a prior-driven, low-shot adaptation paradigm that replaces most embodiment-specific data with transferable video priors.<n>Our results suggest a scalable recipe for "one prior, many embodiments": strong, inexpensive video priors + minimal on-robot alignment.
arXiv Detail & Related papers (2025-07-17T08:31:55Z) - TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness [9.374702244811303]
We introduce a self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers.<n>Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency.
arXiv Detail & Related papers (2025-06-25T16:27:38Z) - Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z) - Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction [47.86532300894681]
Existing approaches rely on Vision-Language-Action (VLA) models to acquire bimanual policies.<n>We propose a novel bimanual foundation policy by fine-tuning the leading text-to-video models to predict robot trajectories.<n>Our method mitigates the ambiguity of language in single-stage text-to-video prediction and significantly reduces the robot-data requirement.
arXiv Detail & Related papers (2025-05-30T03:01:21Z) - Vid2World: Crafting Video Diffusion Models to Interactive World Models [35.42362065437052]
We present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models.<n>Our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.
arXiv Detail & Related papers (2025-05-20T13:41:45Z) - Unified Video Action Model [47.88377984526902]
A unified video and action model holds significant promise for robotics, where videos provide rich scene information for action prediction.<n>We introduce the Unified Video Action model (UVA), which jointly optimize video and action predictions to achieve both high accuracy and efficient action inference.<n>Via an extensive set of experiments, we demonstrate that UVA can serve as a general-purpose solution for a wide range of robotics tasks.
arXiv Detail & Related papers (2025-02-28T21:38:17Z) - VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency.
Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z) - AICL: Action In-Context Learning for Video Diffusion Model [124.39948693332552]
We propose AICL, which empowers the generative model with the ability to understand action information in reference videos.
Extensive experiments demonstrate that AICL effectively captures the action and achieves state-of-the-art generation performance.
arXiv Detail & Related papers (2024-03-18T07:41:19Z) - REST: REtrieve & Self-Train for generative action recognition [54.90704746573636]
We propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition.
We show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting.
We introduce REST, a training framework consisting of two key components.
arXiv Detail & Related papers (2022-09-29T17:57:01Z) - Cross-modal Manifold Cutmix for Self-supervised Video Representation
Learning [50.544635516455116]
This paper focuses on designing video augmentation for self-supervised learning.
We first analyze the best strategy to mix videos to create a new augmented video sample.
We propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities.
arXiv Detail & Related papers (2021-12-07T18:58:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.