Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation
- URL: http://arxiv.org/abs/2405.04327v1
- Date: Tue, 7 May 2024 13:55:50 GMT
- Title: Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation
- Authors: Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Seymanur Aktı, Hazım Kemal Ekenel, Alexander Waibel,
- Abstract summary: We propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training.
We introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance.
- Score: 51.92522679353731
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the task of talking face generation, the objective is to generate a face video with lips synchronized to the corresponding audio while preserving visual details and identity information. Current methods face the challenge of learning accurate lip synchronization while avoiding detrimental effects on visual quality, as well as robustly evaluating such synchronization. To tackle these problems, we propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training. Moreover, leveraging AV-HuBERT's features, we introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance. Experimental results, along with a detailed ablation study, demonstrate the effectiveness of our approach and the utility of the proposed evaluation metrics.
Related papers
- MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting [12.852715177163608]
MuseTalk generates lip-sync targets in a latent space encoded by a Variational Autoencoder.
It supports the online generation of face at 256x256 at more than 30 FPS with negligible starting latency.
arXiv Detail & Related papers (2024-10-14T03:22:26Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space [13.59798532129008]
We propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space.
We introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos.
Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency.
arXiv Detail & Related papers (2024-05-09T09:22:09Z) - SAiD: Speech-driven Blendshape Facial Animation with Diffusion [6.4271091365094515]
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets.
We propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization.
arXiv Detail & Related papers (2023-12-25T04:40:32Z) - Audio-driven Talking Face Generation with Stabilized Synchronization Loss [60.01529422759644]
Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality.
We first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage.
Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization.
arXiv Detail & Related papers (2023-07-18T15:50:04Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided
Adaptive Memory [61.44510300515693]
We study the task of simultaneous lip and devise SimulLR, a simultaneous lip Reading transducer with attention-guided adaptive memory.
The experiments show that the SimulLR achieves the translation speedup 9.10 times times compared with the state-of-the-art non-simultaneous methods.
arXiv Detail & Related papers (2021-08-31T05:54:16Z) - Mutual Information Maximization for Effective Lip Reading [99.11600901751673]
We propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level.
By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.
arXiv Detail & Related papers (2020-03-13T18:47:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.