AVID: Adapting Video Diffusion Models to World Models
- URL: http://arxiv.org/abs/2410.12822v2
- Date: Sun, 24 Nov 2024 19:37:04 GMT
- Title: AVID: Adapting Video Diffusion Models to World Models
- Authors: Marc Rigter, Tarun Gupta, Agrin Hilmkil, Chao Ma,
- Abstract summary: We propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model.
AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos.
We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.
- Score: 10.757223474031248
- License:
- Abstract: Large-scale generative models have achieved remarkable success in a number of domains. However, for sequential decision-making problems, such as robotics, action-labelled data is often scarce and therefore scaling-up foundation models for decision-making remains a challenge. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision-making in downstream tasks. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed-source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain-specific dataset of action-labelled videos. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.1 Our results demonstrate that if utilized correctly, pretrained video models have the potential to be powerful tools for embodied AI.
Related papers
- VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency.
Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z) - Video Diffusion Alignment via Reward Gradients [46.59030518823666]
In this work, we utilize pre-trained reward models that are learned via preferences on top of powerful vision discriminative models to adapt video diffusion models.
We show results across a variety of reward models and video diffusion models, demonstrating that our approach can learn much more efficiently in terms of reward queries and computation.
arXiv Detail & Related papers (2024-07-11T17:59:45Z) - ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner.
We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules.
Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - Probabilistic Adaptation of Text-to-Video Models [181.84311524681536]
Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model.
Video Adapter is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data.
arXiv Detail & Related papers (2023-06-02T19:00:17Z) - AIM: Adapting Image Models for Efficient Video Action Recognition [22.805026175928997]
We propose a method to Adapt pre-trained Image Models (AIM) for efficient video understanding.
By freezing the pre-trained video model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation.
We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters.
arXiv Detail & Related papers (2023-02-06T18:59:17Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - Diffusion Models for Video Prediction and Infilling [27.246449347832108]
We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions.
By varying the mask we condition on, the model is able to perform video prediction, infilling and upsampling.
We evaluate the model on two benchmark datasets for video prediction and one for video generation on which we achieved competitive results.
arXiv Detail & Related papers (2022-06-15T17:44:47Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.