Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation
- URL: http://arxiv.org/abs/2305.19556v3
- Date: Mon, 1 Apr 2024 04:45:30 GMT
- Title: Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation
- Authors: Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro,
- Abstract summary: Context-Aware LipSync- framework (CALS)
CALS is comprised of an Audio-to-Lip map module and a Lip-to-Face module.
- Score: 58.72068260933836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned lip movement. In this respect, we investigate the phonetic context in generating lip motion for talking face generation. We propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages phonetic context to generate lip movement of the target face. CALS is comprised of an Audio-to-Lip module and a Lip-to-Face module. The former is pretrained based on masked learning to map each phone to a contextualized lip motion unit. The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion. From extensive experiments, we verify that simply exploiting the phonetic context in the proposed CALS framework effectively enhances spatio-temporal alignment. We also demonstrate the extent to which the phonetic context assists in lip synchronization and find the effective window size for lip generation to be approximately 1.2 seconds.
Related papers
- RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [63.77823518278202]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - Towards Accurate Lip-to-Speech Synthesis in-the-Wild [31.289366690147556]
We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone.
We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
arXiv Detail & Related papers (2024-03-02T04:07:24Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in
Transformers [91.00397473678088]
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions.
We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality.
Our model can generate high-fidelity lip-synced results for arbitrary subjects.
arXiv Detail & Related papers (2022-12-09T16:32:46Z) - Learning Speaker-specific Lip-to-Speech Generation [28.620557933595585]
This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers.
We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements.
We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks.
arXiv Detail & Related papers (2022-06-04T19:40:02Z) - VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic
Voice Over [68.22776506861872]
We formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO)
A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video.
We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization.
arXiv Detail & Related papers (2021-10-07T11:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.