Related papers: RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

URL: http://arxiv.org/abs/2406.18284v2
Date: Thu, 8 Aug 2024 12:18:30 GMT
Title: RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network
Authors: Xiaozhong Ji, Chuming Lin, Zhonggan Ding, Ying Tai, Junwei Zhu, Xiaobin Hu, Donghao Luo, Yanhao Ge, Chengjie Wang,
Abstract summary: RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. In the second component, we design a lightweight facial identity alignment (FIA) module. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
Score: 48.95833484103569
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater precision in expression prediction. In the second component, we design a lightweight facial identity alignment (FIA) module which includes a lip-shape control structure and a face texture reference structure. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules. Our experimental results, both quantitative and qualitative, on public datasets demonstrate the clear advantages of our method in terms of lip-speech synchronization and generation quality. Furthermore, our method is efficient and requires fewer computational resources, making it well-suited to meet the needs of practical applications.

Related papers

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation [54.52905471078152]
We propose a mask-free talking face generation approach while maintaining the 2D-based face editing task.<n>We transform the input images to have closed mouths, using a two-step landmark-based approach trained in an unpaired manner.
arXiv Detail & Related papers (2025-07-28T16:03:36Z)
PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation [34.43272121705662]
We introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk. Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet. Key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms.
arXiv Detail & Related papers (2024-12-10T18:51:31Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space [13.59798532129008]
We propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. We introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos. Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency.
arXiv Detail & Related papers (2024-05-09T09:22:09Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model. It can synthesize smooth lip dynamics while preserving the speaker's identity. Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z)
Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior [13.198105709331617]
We propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM. This is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation. We show that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods.
arXiv Detail & Related papers (2023-10-05T07:44:49Z)
SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces [28.40393487247833]
Speech-driven 3D face animation technique, extending its applications to various multimedia fields. Previous research has generated promising realistic lip movements and facial expressions from audio signals. We propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces.
arXiv Detail & Related papers (2023-06-19T09:39:10Z)
GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z)
Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers [91.00397473678088]
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality. Our model can generate high-fidelity lip-synced results for arbitrary subjects.
arXiv Detail & Related papers (2022-12-09T16:32:46Z)
DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering [69.9557427451339]
We propose a framework based on neural radiance field to pursue high-fidelity talking head generation. Specifically, neural radiance field takes lip movements features and personalized attributes as two disentangled conditions. We show that our method achieves significantly better results than state-of-the-art methods.
arXiv Detail & Related papers (2022-01-03T18:23:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.