ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with
Diffusion Models
- URL: http://arxiv.org/abs/2311.18834v1
- Date: Thu, 30 Nov 2023 18:59:47 GMT
- Title: ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with
Diffusion Models
- Authors: Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng
Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang,
Zhiwei Xiong
- Abstract summary: ART$boldsymbolcdot$V is an efficient framework for auto-regressive video generation with diffusion models.
It only learns simple continual motions between adjacent frames.
It can generate arbitrarily long videos conditioned on a variety of prompts.
- Score: 99.84195819571411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present ART$\boldsymbol{\cdot}$V, an efficient framework for
auto-regressive video generation with diffusion models. Unlike existing methods
that generate entire videos in one-shot, ART$\boldsymbol{\cdot}$V generates a
single frame at a time, conditioned on the previous ones. The framework offers
three distinct advantages. First, it only learns simple continual motions
between adjacent frames, therefore avoiding modeling complex long-range motions
that require huge training data. Second, it preserves the high-fidelity
generation ability of the pre-trained image diffusion models by making only
minimal network modifications. Third, it can generate arbitrarily long videos
conditioned on a variety of prompts such as text, image or their combinations,
making it highly versatile and flexible. To combat the common drifting issue in
AR models, we propose masked diffusion model which implicitly learns which
information can be drawn from reference images rather than network predictions,
in order to reduce the risk of generating inconsistent appearances that cause
drifting. Moreover, we further enhance generation coherence by conditioning it
on the initial frame, which typically contains minimal noise. This is
particularly useful for long video generation. When trained for only two weeks
on four GPUs, ART$\boldsymbol{\cdot}$V already can generate videos with natural
motions, rich details and a high level of aesthetic quality. Besides, it
enables various appealing applications, e.g., composing a long video from
multiple text prompts.
Related papers
- HARIVO: Harnessing Text-to-Image Models for Video Generation [45.63338167699105]
We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models.
Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique.
Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth.
arXiv Detail & Related papers (2024-10-10T09:47:39Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation [117.13475564834458]
We propose a new way of self-attention calculation, termed Consistent Self-Attention.
To extend our method to long-range video generation, we introduce a novel semantic space temporal motion prediction module.
By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos.
arXiv Detail & Related papers (2024-05-02T16:25:16Z) - StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [58.49820807662246]
We introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.
Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V.
arXiv Detail & Related papers (2024-03-21T18:27:29Z) - BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [40.73982918337828]
We propose a training-free general-purpose video synthesis framework, coined as bf BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models.
Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models.
arXiv Detail & Related papers (2023-12-05T14:56:55Z) - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.