A Lip Sync Expert Is All You Need for Speech to Lip Generation In The
Wild
- URL: http://arxiv.org/abs/2008.10010v1
- Date: Sun, 23 Aug 2020 11:01:25 GMT
- Title: A Lip Sync Expert Is All You Need for Speech to Lip Generation In The
Wild
- Authors: K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar
- Abstract summary: lip-syncing a talking face video of an arbitrary identity to match a target speech segment.
We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator.
We propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos.
- Score: 37.37319356008348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we investigate the problem of lip-syncing a talking face video
of an arbitrary identity to match a target speech segment. Current works excel
at producing accurate lip movements on a static image or videos of specific
people seen during the training phase. However, they fail to accurately morph
the lip movements of arbitrary identities in dynamic, unconstrained talking
face videos, resulting in significant parts of the video being out-of-sync with
the new audio. We identify key reasons pertaining to this and hence resolve
them by learning from a powerful lip-sync discriminator. Next, we propose new,
rigorous evaluation benchmarks and metrics to accurately measure lip
synchronization in unconstrained videos. Extensive quantitative evaluations on
our challenging benchmarks show that the lip-sync accuracy of the videos
generated by our Wav2Lip model is almost as good as real synced videos. We
provide a demo video clearly showing the substantial impact of our Wav2Lip
model and evaluation benchmarks on our website:
\url{cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip- generation-in-the-wild}.
The code and models are released at this GitHub repository:
\url{github.com/Rudrabha/Wav2Lip}. You can also try out the interactive demo at
this link: \url{bhaasha.iiit.ac.in/lipsync}.
Related papers
- Style-Preserving Lip Sync via Audio-Aware Style Reference [88.02195932723744]
Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals.
We develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video.
Experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.
arXiv Detail & Related papers (2024-08-10T02:46:11Z) - Exposing Lip-syncing Deepfakes from Mouth Inconsistencies [29.81606633121959]
A lip-syncing deepfake is a digitally manipulated video in which a person's lip movements are created convincingly using AI models to match altered or entirely new audio.
In this paper, we describe a novel approach, LIP-syncing detection based on mouth INConsistency (LIPINC) for lip-syncing deepfake detection.
arXiv Detail & Related papers (2024-01-18T16:35:37Z) - GestSync: Determining who is speaking without a talking head [67.75387744442727]
We introduce Gesture-Sync: determining if a person's gestures are correlated with their speech or not.
In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement.
We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset.
arXiv Detail & Related papers (2023-10-08T22:48:30Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization [38.64540967776744]
Diff2Lip is an audio-conditioned diffusion-based model which is able to do lip synchronization in-the-wild while preserving these qualities.
We show results on both reconstruction (same audio-video inputs) as well as cross (different audio-video inputs) settings on Voxceleb2 and LRW datasets.
arXiv Detail & Related papers (2023-08-18T17:59:40Z) - Audio-driven Talking Face Generation with Stabilized Synchronization Loss [60.01529422759644]
Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality.
We first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage.
Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization.
arXiv Detail & Related papers (2023-07-18T15:50:04Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices [4.167459103689587]
We address the problem of lip-voice synchronisation in videos containing human face and voice.
Our approach is based on determining if the lips motion and the voice in a video are synchronised or not.
We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models.
arXiv Detail & Related papers (2022-04-05T10:02:39Z) - Visual Speech Enhancement Without A Real Visual Stream [37.88869937166955]
Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises.
Recent works using lip movements as additional cues improve the quality of generated speech over "audio-only" methods.
We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech-driven lip synthesis.
arXiv Detail & Related papers (2020-12-20T06:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.