Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework
- URL: http://arxiv.org/abs/2511.08613v1
- Date: Thu, 13 Nov 2025 01:00:41 GMT
- Title: Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework
- Authors: Dogucan Yaman, Fevziye Irem Eyiokur, Hazım Kemal Ekenel, Alexander Waibel,
- Abstract summary: Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion.<n>This mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio.<n>We propose a systematic evaluation methodology to analyze and quantify lip leakage.
- Score: 56.30142869506262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.
Related papers
- LPIPS-AttnWav2Lip: Generic Audio-Driven lip synchronization for Talking Head Generation in the Wild [9.682333912273906]
This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio.<n>The proposed method achieves outstanding performance in terms of lip synchronization accuracy and visual quality.
arXiv Detail & Related papers (2026-01-30T08:02:49Z) - Identity-Preserving Video Dubbing Using Motion Warping [26.10803670509977]
Video dubbing aims to synthesize realistic, lip-synced videos from a reference video and a driving audio signal.<n>We propose IPTalker, a framework for video dubbing that achieves seamless alignment between driving audio and reference identity.<n>IPTalker consistently outperforms existing approaches in terms of realism, lip synchronization, and identity retention.
arXiv Detail & Related papers (2025-01-08T16:06:21Z) - Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation [51.92522679353731]
We propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training.
We introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance.
arXiv Detail & Related papers (2024-05-07T13:55:50Z) - Audio-driven Talking Face Generation with Stabilized Synchronization Loss [60.01529422759644]
Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality.
We first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage.
Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization.
arXiv Detail & Related papers (2023-07-18T15:50:04Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in
Transformers [91.00397473678088]
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions.
We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality.
Our model can generate high-fidelity lip-synced results for arbitrary subjects.
arXiv Detail & Related papers (2022-12-09T16:32:46Z) - SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via
Audio-Lip Memory [27.255990661166614]
The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio.
Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models.
We propose Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence.
arXiv Detail & Related papers (2022-11-02T07:17:49Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.