Related papers: Magic-Me: Identity-Specific Video Customized Diffusion

Magic-Me: Identity-Specific Video Customized Diffusion

URL: http://arxiv.org/abs/2402.09368v2
Date: Wed, 20 Mar 2024 17:36:35 GMT
Title: Magic-Me: Identity-Specific Video Customized Diffusion
Authors: Ze Ma, Daquan Zhou, Chun-Hsiao Yeh, Xue-She Wang, Xiuyu Li, Huanrui Yang, Zhen Dong, Kurt Keutzer, Jiashi Feng,
Abstract summary: We propose a controllable subject identity controllable video generation framework, termed Video Custom Diffusion (VCD) With a specified identity defined by a few images, VCD reinforces the identity characteristics and injects frame-wise correlation for stable video outputs. We conducted extensive experiments to verify that VCD is able to generate stable videos with better ID over the baselines.
Score: 72.05925155000165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Creating content with specified identities (ID) has attracted significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven creation has achieved great progress with the identity controlled via reference images. However, its extension to video generation is not well explored. In this work, we propose a simple yet effective subject identity controllable video generation framework, termed Video Custom Diffusion (VCD). With a specified identity defined by a few images, VCD reinforces the identity characteristics and injects frame-wise correlation at the initialization stage for stable video outputs. To achieve this, we propose three novel components that are essential for high-quality identity preservation and stable video generation: 1) a noise initialization method with 3D Gaussian Noise Prior for better inter-frame stability; 2) an ID module based on extended Textual Inversion trained with the cropped identity to disentangle the ID information from the background 3) Face VCD and Tiled VCD modules to reinforce faces and upscale the video to higher resolution while preserving the identity's features. We conducted extensive experiments to verify that VCD is able to generate stable videos with better ID over the baselines. Besides, with the transferability of the encoded identity in the ID module, VCD is also working well with personalized text-to-image models available publicly. The codes are available at https://github.com/Zhen-Dong/Magic-Me.

Related papers

Concat-ID: Towards Universal Identity-Preserving Video Synthesis [23.40342294656802]
We present Concat-ID, a unified framework for identity-preserving video synthesis. Concat-ID employs Autoencoders to extract image features, which are latent with video sequence latents. A novel cross-video pairing strategy and a multi-stage training regimen are introduced to balance consistency and facial editability.
arXiv Detail & Related papers (2025-03-18T11:17:32Z)
Phantom: Subject-consistent video generation via cross-modal alignment [16.777805813950486]
We propose a unified video generation framework for both single- and multi-subject references. The proposed method achieves high-fidelity subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion.
arXiv Detail & Related papers (2025-02-16T11:02:50Z)
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects. We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z)
Ingredients: Blending Custom Photos with Video Diffusion Transformers [31.736838809714726]
Ingredients is a framework to customize video creations incorporating multiple specific identity (ID) photos. It consists of three primary modules: (i) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; (ii) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; (iii) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions.
arXiv Detail & Related papers (2025-01-03T12:45:22Z)
PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation [36.21554597804604]
Identity-specific human video generation with customized ID images is still under-explored. We propose a novel framework, dubbed textbfPersonalVideo, that applies direct supervision on videos synthesized by the T2V model. Our method's superiority in delivering high identity faithfulness while preserving the inherent video generation qualities of the original T2V model, outshining prior approaches.
arXiv Detail & Related papers (2024-11-26T02:25:38Z)
ID-Sculpt: ID-aware 3D Head Generation from Single In-the-wild Portrait Image [57.46195661521239]
Previous text-based methods for generating 3D heads were limited by text descriptions and image-based methods struggled to produce high-quality head geometry. We propose a novel framework, ID-Sculpt, to generate high-quality 3D heads while preserving their identities. Extensive experiments demonstrate that we can generate high-quality 3D heads with accurate geometry and texture from a single in-the-wild portrait image.
arXiv Detail & Related papers (2024-06-24T15:11:35Z)
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising. We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z)
ID-Animator: Zero-Shot Identity-Preserving Human Video Generation [16.438935466843304]
ID-Animator is a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training. Our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models.
arXiv Detail & Related papers (2024-04-23T17:59:43Z)
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [58.49820807662246]
We introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions. Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V.
arXiv Detail & Related papers (2024-03-21T18:27:29Z)
Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm [31.06269858216316]
We propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization. We introduce an identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information. We also introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams.
arXiv Detail & Related papers (2024-03-18T13:39:53Z)
Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation [21.739328335601716]
This paper focuses on inserting accurate and interactive ID embedding into the Stable Diffusion Model for personalized generation. We propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background. Our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
arXiv Detail & Related papers (2024-01-31T11:52:33Z)
StableIdentity: Inserting Anybody into Anywhere at First Sight [57.99693188913382]
We propose StableIdentity, which allows identity-consistent recontextualization with just one face image. We are the first to directly inject the identity learned from a single image into video/3D generation without finetuning.
arXiv Detail & Related papers (2024-01-29T09:06:15Z)
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding [102.07914175196817]
PhotoMaker is an efficient personalized text-to-image generation method. It encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information.
arXiv Detail & Related papers (2023-12-07T17:32:29Z)
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.