AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction
- URL: http://arxiv.org/abs/2406.06465v1
- Date: Mon, 10 Jun 2024 17:02:08 GMT
- Title: AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction
- Authors: Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, Yu-Gang Jiang,
- Abstract summary: Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
- Score: 88.70116693750452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction, which has wide applications in virtual reality, robotics, and content creation. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. However, they struggle with frame consistency and temporal stability primarily due to the limited scale of video datasets. We observe that pretrained Image2Video diffusion models possess good priors for video dynamics but they lack textual control. Hence, transferring Image2Video models to leverage their video dynamic priors while injecting instruction control to generate controllable videos is both a meaningful and challenging task. To achieve this, we introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions. More specifically, we design a dual query transformer (DQFormer) architecture, which integrates the instructions and frames into the conditional embeddings for future frame prediction. Additionally, we develop Long-Short Term Temporal Adapters and Spatial Adapters that can quickly transfer general video diffusion models to specific scenarios with minimal training costs. Experimental results show that our method significantly outperforms state-of-the-art techniques on four datasets: Something Something V2, Epic Kitchen-100, Bridge Data, and UCF-101. Notably, AID achieves 91.2% and 55.5% FVD improvements on Bridge and SSv2 respectively, demonstrating its effectiveness in various domains. More examples can be found at our website https://chenhsing.github.io/AID.
Related papers
- FrameBridge: Improving Image-to-Video Generation with Bridge Models [23.19370431940568]
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis.
We present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them.
We propose two techniques, SNR- Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
arXiv Detail & Related papers (2024-10-20T12:10:24Z) - Let Video Teaches You More: Video-to-Image Knowledge Distillation using DEtection TRansformer for Medical Video Lesion Detection [91.97935118185]
We propose Video-to-Image knowledge distillation for the task of medical video lesion detection.
By distilling multi-frame contexts into a single frame, the proposed V2I-DETR combines the advantages of utilizing temporal contexts from video-based models and the inference speed of image-based models.
V2I-DETR outperforms previous state-of-the-art methods by a large margin while achieving the real-time inference speed (30 FPS) as the image-based model.
arXiv Detail & Related papers (2024-08-26T07:17:05Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [40.73982918337828]
We propose a training-free general-purpose video synthesis framework, coined as bf BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models.
Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models.
arXiv Detail & Related papers (2023-12-05T14:56:55Z) - Seer: Language Instructed Video Prediction with Latent Diffusion Models [43.708550061909754]
Text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning.
We propose a sample and computation-efficient model, named textbfSeer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis.
With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames.
arXiv Detail & Related papers (2023-03-27T03:12:24Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z) - Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.