Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis
- URL: http://arxiv.org/abs/2504.13386v1
- Date: Fri, 18 Apr 2025 00:24:52 GMT
- Title: Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis
- Authors: Radek Daněček, Carolin Schmitt, Senya Polikovsky, Michael J. Black,
- Abstract summary: We propose THUNDER, a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production.<n>We show that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.
- Score: 44.503709089687014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync. To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the original input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync. To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talking avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop. Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.
Related papers
- Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics [14.290468730787772]
We introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes.<n>Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization.
arXiv Detail & Related papers (2025-03-26T08:18:57Z) - ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model [41.35209566957009]
Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips.<n>We introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks.
arXiv Detail & Related papers (2025-02-27T17:49:01Z) - AV-Flow: Transforming Text to Audio-Visual Human-like Interactions [101.31009576033776]
AV-Flow is an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input.<n>We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose.
arXiv Detail & Related papers (2025-02-18T18:56:18Z) - GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio.<n>We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z) - Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert [13.60808166889775]
We introduce a method for speech-driven 3D facial animation to generate accurate lip movements.
This loss provides guidance to train the speech-driven 3D facial animators to generate plausible lip motions aligned with the spoken transcripts.
We validate the effectiveness of our approach through broad experiments, showing noticeable improvements in lip synchronization and lip readability performance.
arXiv Detail & Related papers (2024-07-01T07:39:28Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - A Novel Speech-Driven Lip-Sync Model with CNN and LSTM [12.747541089354538]
We present a combined deep neural network of one-dimensional convolutions and LSTM to generate displacement of a 3D template face model from variable-length speech input.
In order to enhance the robustness of the network to different sound signals, we adapt a trained speech recognition model to extract speech feature.
We show that our model is able to generate smooth and natural lip movements synchronized with speech.
arXiv Detail & Related papers (2022-05-02T13:57:50Z) - DFA-NeRF: Personalized Talking Head Generation via Disentangled Face
Attributes Neural Rendering [69.9557427451339]
We propose a framework based on neural radiance field to pursue high-fidelity talking head generation.
Specifically, neural radiance field takes lip movements features and personalized attributes as two disentangled conditions.
We show that our method achieves significantly better results than state-of-the-art methods.
arXiv Detail & Related papers (2022-01-03T18:23:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.