Audio-Driven Talking Face Video Generation with Joint Uncertainty Learning
- URL: http://arxiv.org/abs/2504.18810v1
- Date: Sat, 26 Apr 2025 05:45:38 GMT
- Title: Audio-Driven Talking Face Video Generation with Joint Uncertainty Learning
- Authors: Yifan Xie, Fei Ma, Yi Bin, Ying He, Fei Yu,
- Abstract summary: We propose a Joint Uncertainty Learning Network (JULNet) for high-quality talking face video generation.<n>We first design an uncertainty module to individually predict the error map and uncertainty map after obtaining the generated image.<n>By jointly optimizing error and uncertainty, the performance and robustness of our model can be enhanced.
- Score: 11.551314848756107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Talking face video generation with arbitrary speech audio is a significant challenge within the realm of digital human technology. The previous studies have emphasized the significance of audio-lip synchronization and visual quality. Currently, limited attention has been given to the learning of visual uncertainty, which creates several issues in existing systems, including inconsistent visual quality and unreliable performance across different input conditions. To address the problem, we propose a Joint Uncertainty Learning Network (JULNet) for high-quality talking face video generation, which incorporates a representation of uncertainty that is directly related to visual error. Specifically, we first design an uncertainty module to individually predict the error map and uncertainty map after obtaining the generated image. The error map represents the difference between the generated image and the ground truth image, while the uncertainty map is used to predict the probability of incorrect estimates. Furthermore, to match the uncertainty distribution with the error distribution through a KL divergence term, we introduce a histogram technique to approximate the distributions. By jointly optimizing error and uncertainty, the performance and robustness of our model can be enhanced. Extensive experiments demonstrate that our method achieves superior high-fidelity and audio-lip synchronization in talking face video generation compared to previous methods.
Related papers
- ViLU: Learning Vision-Language Uncertainties for Failure Prediction [28.439422629957424]
We introduce ViLU, a new Vision-Language Uncertainty quantification framework.<n>ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention.<n>Our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself.
arXiv Detail & Related papers (2025-07-10T10:41:13Z) - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation [51.92522679353731]
We propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training.
We introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance.
arXiv Detail & Related papers (2024-05-07T13:55:50Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning
for Video Question Answering [63.12469700986452]
We introduce the concept of uncertainty-aware curriculum learning (CL)
Here, uncertainty serves as the guiding principle for dynamically adjusting the difficulty.
In practice, we seamlessly integrate the VideoQA model into our framework and conduct comprehensive experiments.
arXiv Detail & Related papers (2024-01-03T02:29:34Z) - Realistic Speech-to-Face Generation with Speech-Conditioned Latent
Diffusion Model with Face Prior [13.198105709331617]
We propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM.
This is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation.
We show that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods.
arXiv Detail & Related papers (2023-10-05T07:44:49Z) - Audio-driven Talking Face Generation with Stabilized Synchronization Loss [60.01529422759644]
Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality.
We first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage.
Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization.
arXiv Detail & Related papers (2023-07-18T15:50:04Z) - Combating Uncertainty and Class Imbalance in Facial Expression
Recognition [4.306007841758853]
We propose a framework based on Resnet and Attention to solve the above problems.
Our method surpasses most basic methods in terms of accuracy on facial expression data sets.
arXiv Detail & Related papers (2022-12-15T12:09:02Z) - Multi-Contextual Predictions with Vision Transformer for Video Anomaly
Detection [22.098399083491937]
understanding of thetemporal context of a video plays a vital role in anomaly detection.
We design a transformer model with three different contextual prediction streams: masked, whole and partial.
By learning to predict the missing frames of consecutive normal frames, our model can effectively learn various normality patterns in the video.
arXiv Detail & Related papers (2022-06-17T05:54:31Z) - Scene Uncertainty and the Wellington Posterior of Deterministic Image
Classifiers [68.9065881270224]
We introduce the Wellington Posterior, which is the distribution of outcomes that would have been obtained in response to data that could have been generated by the same scene.
We explore the use of data augmentation, dropout, ensembling, single-view reconstruction, and model linearization to compute a Wellington Posterior.
Additional methods include the use of conditional generative models such as generative adversarial networks, neural radiance fields, and conditional prior networks.
arXiv Detail & Related papers (2021-06-25T20:10:00Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.