VideoDreamer: Customized Multi-Subject Text-to-Video Generation with
Disen-Mix Finetuning
- URL: http://arxiv.org/abs/2311.00990v1
- Date: Thu, 2 Nov 2023 04:38:50 GMT
- Title: VideoDreamer: Customized Multi-Subject Text-to-Video Generation with
Disen-Mix Finetuning
- Authors: Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin
Han, Wenwu Zhu
- Abstract summary: VideoDreamer can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects.
The video generator is further customized for the given multiple subjects by the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy.
- Score: 47.61090084143284
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Customized text-to-video generation aims to generate text-guided videos with
customized user-given subjects, which has gained increasing attention recently.
However, existing works are primarily limited to generating videos for a single
subject, leaving the more challenging problem of customized multi-subject
text-to-video generation largely unexplored. In this paper, we fill this gap
and propose a novel VideoDreamer framework. VideoDreamer can generate
temporally consistent text-guided videos that faithfully preserve the visual
features of the given multiple subjects. Specifically, VideoDreamer leverages
the pretrained Stable Diffusion with latent-code motion dynamics and temporal
cross-frame attention as the base video generator. The video generator is
further customized for the given multiple subjects by the proposed Disen-Mix
Finetuning and Human-in-the-Loop Re-finetuning strategy, which can tackle the
attribute binding problem of multi-subject generation. We also introduce
MultiStudioBench, a benchmark for evaluating customized multi-subject
text-to-video generation models. Extensive experiments demonstrate the
remarkable ability of VideoDreamer to generate videos with new content such as
new events and backgrounds, tailored to the customized multiple subjects. Our
project page is available at https://videodreamer23.github.io/.
Related papers
- DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control [48.41743234012456]
DisenStudio is a novel framework that can generate text-guided videos for customized multiple subjects.
DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism.
We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics.
arXiv Detail & Related papers (2024-05-21T13:44:55Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects.
We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning [62.51232333352754]
VideoDirectorGPT is a novel framework for consistent multi-scene video generation.
Our proposed framework substantially improves layout and movement control in both single- and multi-scene video generation.
arXiv Detail & Related papers (2023-09-26T17:36:26Z) - Gen-L-Video: Multi-Text to Long Video Generation via Temporal
Co-Denoising [43.35391175319815]
This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos.
We introduce a novel paradigm dubbed Gen-L-Video, capable of extending off-the-shelf short video diffusion models.
Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models.
arXiv Detail & Related papers (2023-05-29T17:38:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.