MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
- URL: http://arxiv.org/abs/2311.18829v2
- Date: Fri, 29 Dec 2023 09:12:52 GMT
- Title: MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
- Authors: Yanhui Wang, Jianmin Bao, Wenming Weng, Ruoyu Feng, Dacheng Yin, Tao
Yang, Jingxu Zhang, Qi Dai Zhiyuan Zhao, Chunyu Wang, Kai Qiu, Yuhui Yuan,
Chuanxin Tang, Xiaoyan Sun, Chong Luo, Baining Guo
- Abstract summary: MicroCinema is a framework for high-quality and coherent text-to-video generation.
We introduce a Divide-and-Conquer strategy which divides the text-to-video into a two-stage process: text-to-image generation and image&text-to-video generation.
We show that MicroCinema achieves SOTA zero-shot FVD of 342.86 on UCF-101 and 377.40 on MSR-VTT.
- Score: 64.77240137998862
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MicroCinema, a straightforward yet effective framework for
high-quality and coherent text-to-video generation. Unlike existing approaches
that align text prompts with video directly, MicroCinema introduces a
Divide-and-Conquer strategy which divides the text-to-video into a two-stage
process: text-to-image generation and image\&text-to-video generation. This
strategy offers two significant advantages. a) It allows us to take full
advantage of the recent advances in text-to-image models, such as Stable
Diffusion, Midjourney, and DALLE, to generate photorealistic and highly
detailed images. b) Leveraging the generated image, the model can allocate less
focus to fine-grained appearance details, prioritizing the efficient learning
of motion dynamics. To implement this strategy effectively, we introduce two
core designs. First, we propose the Appearance Injection Network, enhancing the
preservation of the appearance of the given image. Second, we introduce the
Appearance Noise Prior, a novel mechanism aimed at maintaining the capabilities
of pre-trained 2D diffusion models. These design elements empower MicroCinema
to generate high-quality videos with precise motion, guided by the provided
text prompts. Extensive experiments demonstrate the superiority of the proposed
framework. Concretely, MicroCinema achieves SOTA zero-shot FVD of 342.86 on
UCF-101 and 377.40 on MSR-VTT. See
https://wangyanhui666.github.io/MicroCinema.github.io/ for video samples.
Related papers
- FrameBridge: Improving Image-to-Video Generation with Bridge Models [23.19370431940568]
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis.
We present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them.
We propose two techniques, SNR- Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
arXiv Detail & Related papers (2024-10-20T12:10:24Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary
Generator [34.7504057664375]
We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video.
Step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions.
arXiv Detail & Related papers (2020-09-04T06:33:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.