Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter
- URL: http://arxiv.org/abs/2503.06397v1
- Date: Sun, 09 Mar 2025 02:36:31 GMT
- Title: Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter
- Authors: Yanyu Zhu, Licheng Bai, Jintao Xu, Jiwei Tang, Hai-tao Zheng,
- Abstract summary: "lip averaging" phenomenon occurs when a model fails to preserve subtle facial details when dubbing unseen in-the-wild videos.<n>We propose UnAvgLip, which extracts identity embeddings from reference videos to generate highly faithful facial sequences.
- Score: 10.608872317957026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in diffusion-based lip-syncing generative models have demonstrated their ability to produce highly synchronized talking face videos for visual dubbing. Although these models excel at lip synchronization, they often struggle to maintain fine-grained control over facial details in generated images. In this work, we identify "lip averaging" phenomenon where the model fails to preserve subtle facial details when dubbing unseen in-the-wild videos. This issue arises because the commonly used UNet backbone primarily integrates audio features into visual representations in the latent space via cross-attention mechanisms and multi-scale fusion, but it struggles to retain fine-grained lip details in the generated faces. To address this issue, we propose UnAvgLip, which extracts identity embeddings from reference videos to generate highly faithful facial sequences while maintaining accurate lip synchronization. Specifically, our method comprises two primary components: (1) an Identity Perceiver module that encodes facial embeddings to align with conditioned audio features; and (2) an ID-CrossAttn module that injects facial embeddings into the generation process, enhancing model's capability of identity retention. Extensive experiments demonstrate that, at a modest training and inference cost, UnAvgLip effectively mitigates the "averaging" phenomenon in lip inpainting, significantly preserving unique facial characteristics while maintaining precise lip synchronization. Compared with the original approach, our method demonstrates significant improvements of 5% on the identity consistency metric and 2% on the SSIM metric across two benchmark datasets (HDTF and LRW).
Related papers
- EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion [3.592206475366951]
Existing methods struggle with "copy-paste" artifacts and low similarity issues.<n>We propose EchoVideo, which integrates high-level semantic features from text to capture clean facial identity representations.<n>It achieves excellent results in generating high-quality, controllability and fidelity videos.
arXiv Detail & Related papers (2025-01-23T08:06:11Z) - EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation [8.314556078632412]
We introduce EmojiDiff, the first end-to-end solution that enables simultaneous control of extremely detailed expression (RGB-level) and high-fidelity identity in portrait generation.
For decoupled training, we innovate ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression pairs.
We also present ID-enhanced Contrast Alignment (ICA) for further fine-tuning.
arXiv Detail & Related papers (2024-12-02T08:24:11Z) - HiFiVFS: High Fidelity Video Face Swapping [35.49571526968986]
Face swapping aims to generate results that combine the identity from the source with attributes from the target.<n>We propose a high fidelity video face swapping framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion.<n>Our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-11-27T12:30:24Z) - PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation [36.21554597804604]
Identity-specific human video generation with customized ID images is still under-explored.
Key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following.
We propose a novel framework, dubbed $textbfPersonalVideo$, that applies a mixture of reward supervision on synthesized videos.
arXiv Detail & Related papers (2024-11-26T02:25:38Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space [13.59798532129008]
We propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space.
We introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos.
Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency.
arXiv Detail & Related papers (2024-05-09T09:22:09Z) - Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation [51.92522679353731]
We propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training.
We introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance.
arXiv Detail & Related papers (2024-05-07T13:55:50Z) - ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising.
We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z) - When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for
Personalized Image Generation [60.305112612629465]
Text-to-image diffusion models have excelled in producing diverse, high-quality, and photo-realistic images.
We present a novel use of the extended StyleGAN embedding space $mathcalW_+$ to achieve enhanced identity preservation and disentanglement for diffusion models.
Our method adeptly generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions.
arXiv Detail & Related papers (2023-11-29T09:05:14Z) - Audio-driven Talking Face Generation with Stabilized Synchronization Loss [60.01529422759644]
Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality.
We first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage.
Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization.
arXiv Detail & Related papers (2023-07-18T15:50:04Z) - DVG-Face: Dual Variational Generation for Heterogeneous Face Recognition [85.94331736287765]
We formulate HFR as a dual generation problem, and tackle it via a novel Dual Variational Generation (DVG-Face) framework.
We integrate abundant identity information of large-scale visible data into the joint distribution.
Massive new diverse paired heterogeneous images with the same identity can be generated from noises.
arXiv Detail & Related papers (2020-09-20T09:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.