See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement
- URL: http://arxiv.org/abs/2510.26819v1
- Date: Tue, 28 Oct 2025 09:46:19 GMT
- Title: See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement
- Authors: Jinting Wang, Jun Wang, Hei Victor Cheng, Li Liu,
- Abstract summary: This work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face.<n> Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.
- Score: 19.653004988642163
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.
Related papers
- From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech [26.67378997911053]
The objective of this study is to generate high-quality speech from silent talking face videos.<n>We propose a novel video-to-speech system that bridges the modality gap between silent video and multi-faceted speech.<n>Our method achieves exceptional generation quality comparable to real utterances.
arXiv Detail & Related papers (2025-03-21T09:02:38Z) - PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation [48.94486508604052]
We introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk.<n>Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet.<n>Key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms.
arXiv Detail & Related papers (2024-12-10T18:51:31Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - Controllable Talking Face Generation by Implicit Facial Keypoints Editing [6.036277153327655]
We present ControlTalk, a talking face generation method to control face expression deformation based on driven audio.
Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD.
arXiv Detail & Related papers (2024-06-05T02:54:46Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Realistic Speech-to-Face Generation with Speech-Conditioned Latent
Diffusion Model with Face Prior [13.198105709331617]
We propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM.
This is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation.
We show that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods.
arXiv Detail & Related papers (2023-10-05T07:44:49Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - DiffVoice: Text-to-Speech with Latent Diffusion [18.150627638754923]
We present DiffVoice, a novel text-to-speech model based on latent diffusion.
Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness.
arXiv Detail & Related papers (2023-04-23T21:05:33Z) - DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder [55.58582254514431]
We propose DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech.<n>We also introduce pose modelling in speech2latent for pose controllability.<n>Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness.
arXiv Detail & Related papers (2023-03-30T17:18:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.