Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization
- URL: http://arxiv.org/abs/2308.09716v1
- Date: Fri, 18 Aug 2023 17:59:40 GMT
- Title: Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization
- Authors: Soumik Mukhopadhyay, Saksham Suri, Ravi Teja Gadde, Abhinav
Shrivastava
- Abstract summary: Diff2Lip is an audio-conditioned diffusion-based model which is able to do lip synchronization in-the-wild while preserving these qualities.
We show results on both reconstruction (same audio-video inputs) as well as cross (different audio-video inputs) settings on Voxceleb2 and LRW datasets.
- Score: 38.64540967776744
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The task of lip synchronization (lip-sync) seeks to match the lips of human
faces with different audio. It has various applications in the film industry as
well as for creating virtual avatars and for video conferencing. This is a
challenging problem as one needs to simultaneously introduce detailed,
realistic lip movements while preserving the identity, pose, emotions, and
image quality. Many of the previous methods trying to solve this problem suffer
from image quality degradation due to a lack of complete contextual
information. In this paper, we present Diff2Lip, an audio-conditioned
diffusion-based model which is able to do lip synchronization in-the-wild while
preserving these qualities. We train our model on Voxceleb2, a video dataset
containing in-the-wild talking face videos. Extensive studies show that our
method outperforms popular methods like Wav2Lip and PC-AVS in Fr\'echet
inception distance (FID) metric and Mean Opinion Scores (MOS) of the users. We
show results on both reconstruction (same audio-video inputs) as well as cross
(different audio-video inputs) settings on Voxceleb2 and LRW datasets. Video
results and code can be accessed from our project page (
https://soumik-kanad.github.io/diff2lip ).
Related papers
- MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting [12.852715177163608]
MuseTalk generates lip-sync targets in a latent space encoded by a Variational Autoencoder.
It supports the online generation of face at 256x256 at more than 30 FPS with negligible starting latency.
arXiv Detail & Related papers (2024-10-14T03:22:26Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - ReliTalk: Relightable Talking Portrait Generation from a Single Video [62.47116237654984]
ReliTalk is a novel framework for relightable audio-driven talking portrait generation from monocular videos.
Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images.
arXiv Detail & Related papers (2023-09-05T17:59:42Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in
Transformers [91.00397473678088]
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions.
We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality.
Our model can generate high-fidelity lip-synced results for arbitrary subjects.
arXiv Detail & Related papers (2022-12-09T16:32:46Z) - VideoReTalking: Audio-based Lip Synchronization for Talking Head Video
Editing In the Wild [37.93856291026653]
VideoReTalking is a new system to edit the faces of a real-world talking head video according to input audio.
It produces a high-quality and lip-syncing output video even with a different emotion.
arXiv Detail & Related papers (2022-11-27T08:14:23Z) - SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via
Audio-Lip Memory [27.255990661166614]
The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio.
Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models.
We propose Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence.
arXiv Detail & Related papers (2022-11-02T07:17:49Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - One Shot Audio to Animated Video Generation [15.148595295859659]
We propose a novel method to generate an animated video of arbitrary length using an audio clip and a single unseen image of a person as an input.
OneShotAu2AV can generate animated videos that have: (a) lip movements that are in sync with the audio, (b) natural facial expressions such as blinks and eyebrow movements, (c) head movements.
arXiv Detail & Related papers (2021-02-19T04:29:17Z) - A Lip Sync Expert Is All You Need for Speech to Lip Generation In The
Wild [37.37319356008348]
lip-syncing a talking face video of an arbitrary identity to match a target speech segment.
We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator.
We propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos.
arXiv Detail & Related papers (2020-08-23T11:01:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.