Related papers: Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

URL: http://arxiv.org/abs/2305.19556v3
Date: Mon, 1 Apr 2024 04:45:30 GMT
Title: Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation
Authors: Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro,
Abstract summary: Context-Aware LipSync- framework (CALS) CALS is comprised of an Audio-to-Lip map module and a Lip-to-Face module.
Score: 58.72068260933836
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned lip movement. In this respect, we investigate the phonetic context in generating lip motion for talking face generation. We propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages phonetic context to generate lip movement of the target face. CALS is comprised of an Audio-to-Lip module and a Lip-to-Face module. The former is pretrained based on masked learning to map each phone to a contextualized lip motion unit. The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion. From extensive experiments, we verify that simply exploiting the phonetic context in the proposed CALS framework effectively enhances spatio-temporal alignment. We also demonstrate the extent to which the phonetic context assists in lip synchronization and find the effective window size for lip generation to be approximately 1.2 seconds.

Related papers

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation [24.2065254076207]
We introduce a novel method for joint expression and audio-guided talking face generation. Our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer.
arXiv Detail & Related papers (2024-09-18T17:18:13Z)
Style-Preserving Lip Sync via Audio-Aware Style Reference [88.02195932723744]
Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals. We develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video. Experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.
arXiv Detail & Related papers (2024-08-10T02:46:11Z)
RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. In the second component, we design a lightweight facial identity alignment (FIA) module. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z)
Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance. We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z)
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input. Previous studies revealed the importance of lip-speech synchronization and visual quality. We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z)
VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over [68.22776506861872]
We formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO) A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video. We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization.
arXiv Detail & Related papers (2021-10-07T11:25:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.