Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers
- URL: http://arxiv.org/abs/2501.03931v1
- Date: Tue, 07 Jan 2025 16:48:31 GMT
- Title: Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers
- Authors: Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, Jiaya Jia,
- Abstract summary: We present Magic Mirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion.<n>Our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data.
- Score: 42.910185323392554
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present Magic Mirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that Magic Mirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available at: https://github.com/dvlab-research/MagicMirror/
Related papers
- Multi-identity Human Image Animation with Structural Video Diffusion [64.20452431561436]
We present Structural Video Diffusion, a novel framework for generating realistic multi-human videos.
Our approach introduces identity-specific embeddings to maintain consistent appearances across individuals.
We expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios.
arXiv Detail & Related papers (2025-04-05T10:03:49Z) - Concat-ID: Towards Universal Identity-Preserving Video Synthesis [23.40342294656802]
We present Concat-ID, a unified framework for identity-preserving video synthesis.
Concat-ID employs Autoencoders to extract image features, which are latent with video sequence latents.
A novel cross-video pairing strategy and a multi-stage training regimen are introduced to balance consistency and facial editability.
arXiv Detail & Related papers (2025-03-18T11:17:32Z) - SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers [30.06494915665044]
We present SkyReels-A1, a framework built upon video diffusion Transformer to facilitate portrait image animation.
SkyReels-A1 capitalizes on the strong generative capabilities of video DiT, enhancing facial motion transfer precision, identity retention, and temporal coherence.
It is highly applicable to domains such as virtual avatars, remote communication, and digital media generation.
arXiv Detail & Related papers (2025-02-15T16:08:40Z) - EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion [3.592206475366951]
Existing methods struggle with "copy-paste" artifacts and low similarity issues.
We propose EchoVideo, which integrates high-level semantic features from text to capture clean facial identity representations.
It achieves excellent results in generating high-quality, controllability and fidelity videos.
arXiv Detail & Related papers (2025-01-23T08:06:11Z) - DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time.<n>We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models.<n>Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z) - VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping [43.30061680192465]
We present the first diffusion-based framework specifically designed for video face swapping.<n>Our approach incorporates a specially designed diffusion model coupled with a VidFaceVAE.<n>Our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods.
arXiv Detail & Related papers (2024-12-15T18:58:32Z) - Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation [47.61288672890036]
We investigate how self-attention query features govern motion, structure, and identity in text-to-video models.
Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity.
We demonstrate two applications: (1) a zero-shot motion transfer method that is 20 times more efficient than existing approaches, and (2) a training-free technique for consistent multi-shot video generation.
arXiv Detail & Related papers (2024-12-10T18:49:39Z) - MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.<n>MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z) - Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks [25.39030226963548]
We introduce the first application of a pretrained transformer-based video generative model for portrait animation.<n>Our method is validated through experiments on benchmark and newly proposed wild datasets.
arXiv Detail & Related papers (2024-12-01T08:54:30Z) - Motion Control for Enhanced Complex Action Video Generation [17.98485830881648]
Existing text-to-video (T2V) models often struggle with generating videos with sufficiently pronounced or complex actions.
We propose a novel framework, MVideo, designed to produce long-duration videos with precise, fluid actions.
MVideo overcomes the limitations of text prompts by incorporating mask sequences as an additional motion condition input.
arXiv Detail & Related papers (2024-11-13T04:20:45Z) - Magic-Me: Identity-Specific Video Customized Diffusion [72.05925155000165]
We propose a controllable subject identity controllable video generation framework, termed Video Custom Diffusion (VCD)
With a specified identity defined by a few images, VCD reinforces the identity characteristics and injects frame-wise correlation for stable video outputs.
We conducted extensive experiments to verify that VCD is able to generate stable videos with better ID over the baselines.
arXiv Detail & Related papers (2024-02-14T18:13:51Z) - MAGVIT: Masked Generative Video Transformer [129.50814875955444]
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model.
A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains.
arXiv Detail & Related papers (2022-12-10T04:26:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.