Related papers: DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

URL: http://arxiv.org/abs/2405.12796v1
Date: Tue, 21 May 2024 13:44:55 GMT
Title: DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control
Authors: Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao Tang, Wenwu Zhu,
Abstract summary: DisenStudio is a novel framework that can generate text-guided videos for customized multiple subjects. DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics.
Score: 48.41743234012456
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.

Related papers

Subject-driven Video Generation via Disentangled Identity and Motion [52.54835936914813]
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings.
arXiv Detail & Related papers (2025-04-23T06:48:31Z)
VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models [24.004996738924902]
VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos. We develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns.
arXiv Detail & Related papers (2025-03-27T17:59:58Z)
Multi-subject Open-set Personalization in Video Generation [110.02124633005516]
We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt. Our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-01-10T18:59:54Z)
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities [56.5742116979914]
CustomCrafter preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage.
arXiv Detail & Related papers (2024-08-23T17:26:06Z)
AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation [41.990464968024845]
We introduce a training-free multi-agent framework called AutoStudio for generating interactive images. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well.
arXiv Detail & Related papers (2024-06-03T14:51:24Z)
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z)
Customizing Motion in Text-to-Video Diffusion Models [79.4121510826141]
We introduce an approach for augmenting text-to-video generation models with customized motions. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios.
arXiv Detail & Related papers (2023-12-07T18:59:03Z)
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning [47.61090084143284]
VideoDreamer can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. The video generator is further customized for the given multiple subjects by the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy.
arXiv Detail & Related papers (2023-11-02T04:38:50Z)
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.