Make-Your-Video: Customized Video Generation Using Textual and
Structural Guidance
- URL: http://arxiv.org/abs/2306.00943v1
- Date: Thu, 1 Jun 2023 17:43:27 GMT
- Title: Make-Your-Video: Customized Video Generation Using Textual and
Structural Guidance
- Authors: Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang,
Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan,
Tien-Tsin Wong
- Abstract summary: Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only.
In this paper, we explore customized video generation by utilizing text as context description and motion structure.
Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model.
- Score: 36.26032505627126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Creating a vivid video from the event or scenario in our imagination is a
truly fascinating experience. Recent advancements in text-to-video synthesis
have unveiled the potential to achieve this with prompts only. While text is
convenient in conveying the overall scene context, it may be insufficient to
control precisely. In this paper, we explore customized video generation by
utilizing text as context description and motion structure (e.g. frame-wise
depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves
joint-conditional video generation using a Latent Diffusion Model that is
pre-trained for still image synthesis and then promoted for video generation
with the introduction of temporal modules. This two-stage learning scheme not
only reduces the computing resources required, but also improves the
performance by transferring the rich concepts available in image datasets
solely into video generation. Moreover, we use a simple yet effective causal
attention mask strategy to enable longer video synthesis, which mitigates the
potential quality degradation effectively. Experimental results show the
superiority of our method over existing baselines, particularly in terms of
temporal coherence and fidelity to users' guidance. In addition, our model
enables several intriguing applications that demonstrate potential for
practical usage.
Related papers
- WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text
and Image Inputs [53.21307319844615]
We present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework.
The framework includes two parts: prompt enhancer and full video translation.
arXiv Detail & Related papers (2024-03-10T16:09:02Z) - Make Pixels Dance: High-Dynamic Video Generation [13.944607760918997]
State-of-the-art video generation methods tend to produce video clips with minimal motions despite maintaining high fidelity.
We introduce PixelDance, a novel approach that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation.
arXiv Detail & Related papers (2023-11-18T06:25:58Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style
Transfer [13.098901971644656]
This paper proposes a zero-shot video stylization method named Style-A-Video.
Uses a generative pre-trained transformer with an image latent diffusion model to achieve a concise text-controlled video stylization.
Tests show that we can attain superior content preservation and stylistic performance while incurring less consumption than previous solutions.
arXiv Detail & Related papers (2023-05-09T14:03:27Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z) - Transcript to Video: Efficient Clip Sequencing from Texts [65.87890762420922]
We present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots.
Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles.
For fast inference, we introduce an efficient search strategy for real-time video clip sequencing.
arXiv Detail & Related papers (2021-07-25T17:24:50Z) - TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary
Generator [34.7504057664375]
We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video.
Step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions.
arXiv Detail & Related papers (2020-09-04T06:33:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.