Related papers: EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

URL: http://arxiv.org/abs/2501.13452v1
Date: Thu, 23 Jan 2025 08:06:11 GMT
Title: EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion
Authors: Jiangchuan Wei, Shiyue Yan, Wenfeng Lin, Boyuan Liu, Renjie Chen, Mingyu Guo,
Abstract summary: Existing methods struggle with "copy-paste" artifacts and low similarity issues. We propose EchoVideo, which integrates high-level semantic features from text to capture clean facial identity representations. It achieves excellent results in generating high-quality, controllability and fidelity videos.
Score: 3.592206475366951
License:
Abstract: Recent advancements in video generation have significantly impacted various downstream applications, particularly in identity-preserving video generation (IPT2V). However, existing methods struggle with "copy-paste" artifacts and low similarity issues, primarily due to their reliance on low-level facial image information. This dependence can result in rigid facial appearances and artifacts reflecting irrelevant details. To address these challenges, we propose EchoVideo, which employs two key strategies: (1) an Identity Image-Text Fusion Module (IITF) that integrates high-level semantic features from text, capturing clean facial identity representations while discarding occlusions, poses, and lighting variations to avoid the introduction of artifacts; (2) a two-stage training strategy, incorporating a stochastic method in the second phase to randomly utilize shallow facial information. The objective is to balance the enhancements in fidelity provided by shallow features while mitigating excessive reliance on them. This strategy encourages the model to utilize high-level features during training, ultimately fostering a more robust representation of facial identities. EchoVideo effectively preserves facial identities and maintains full-body integrity. Extensive experiments demonstrate that it achieves excellent results in generating high-quality, controllability and fidelity videos.

Related papers

VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping [43.30061680192465]
We present the first diffusion-based framework specifically designed for video face swapping. Our approach incorporates a specially designed diffusion model coupled with a VidFaceVAE. Our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods.
arXiv Detail & Related papers (2024-12-15T18:58:32Z)
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos. MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z)
HiFiVFS: High Fidelity Video Face Swapping [35.49571526968986]
Face swapping aims to generate results that combine the identity from the source with attributes from the target. We propose a high fidelity video face swapping framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion. Our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-11-27T12:30:24Z)
OSDFace: One-Step Diffusion Model for Face Restoration [72.5045389847792]
Diffusion models have demonstrated impressive performance in face restoration. We propose OSDFace, a novel one-step diffusion model for face restoration. Results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics.
arXiv Detail & Related papers (2024-11-26T07:07:48Z)
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising. We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z)
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V) We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z)
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos. We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z)
Facial Expression Video Generation Based-On Spatio-temporal Convolutional GAN: FEV-GAN [1.279257604152629]
We present a novel approach for generating videos of the six basic facial expressions. Our approach is based on Spatio-temporal Conal GANs, that are known to model both content and motion in the same network. The code and the pre-trained model will soon be made publicly available.
arXiv Detail & Related papers (2022-10-20T11:54:32Z)
StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3 [43.43545400625567]
We propose a principled framework named StyleFaceV, which produces high-fidelity identity-preserving face videos with vivid movements. Our core insight is to decompose appearance and pose information and recompose them in the latent space of StyleGAN3 to produce stable and dynamic results.
arXiv Detail & Related papers (2022-08-16T17:47:03Z)
Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection [112.96004727646115]
We develop a method to detect face-manipulated videos using real talking faces. We show that our method achieves state-of-the-art performance on cross-manipulation generalisation and robustness experiments. Our results suggest that leveraging natural and unlabelled videos is a promising direction for the development of more robust face forgery detectors.
arXiv Detail & Related papers (2022-01-18T17:14:54Z)
Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging [19.285149134711382]
Facial image generation based on vocal characteristics from speech is one of such important yet challenging tasks. Existing solutions to the problem of speech2face renders limited image quality and fails to preserve facial similarity. We propose Speech Fusion to Face, or SF2F, attempting to address the issue of facial image quality and the poor connection between vocal feature domain and modern image generation models.
arXiv Detail & Related papers (2020-06-10T15:19:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.