Multi-subject Open-set Personalization in Video Generation
- URL: http://arxiv.org/abs/2501.06187v1
- Date: Fri, 10 Jan 2025 18:59:54 GMT
- Title: Multi-subject Open-set Personalization in Video Generation
- Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov,
- Abstract summary: We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities.
Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt.
Our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
- Score: 110.02124633005516
- License:
- Abstract: Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
Related papers
- CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training [35.43906754134253]
We propose CustomTTT, where we can joint custom the appearance and motion of the given video easily.
Since each LoRA is trained individually, we propose a novel test-time training technique to update parameters after combination.
Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2024-12-20T08:05:13Z) - TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation [67.97044071594257]
TweedieMix is a novel method for composing customized diffusion models.
Our framework can be effortlessly extended to image-to-video diffusion models.
arXiv Detail & Related papers (2024-10-08T01:06:01Z) - JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset.
We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model.
Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z) - DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control [48.41743234012456]
DisenStudio is a novel framework that can generate text-guided videos for customized multiple subjects.
DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism.
We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics.
arXiv Detail & Related papers (2024-05-21T13:44:55Z) - CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects.
We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.