VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis
- URL: http://arxiv.org/abs/2507.06060v2
- Date: Mon, 21 Jul 2025 13:19:17 GMT
- Title: VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis
- Authors: Alexandre Symeonidis-Herzig, Özge Mercanoğlu Sincan, Richard Bowden,
- Abstract summary: We propose VisualSpeaker, a novel method that bridges the gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation.<n>Our contribution is a perceptual lip-reading loss, derived by passing 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training.<n> Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation.
- Score: 70.76837748695841
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Realistic, high-fidelity 3D facial animations are crucial for expressive avatar systems in human-computer interaction and accessibility. Although prior methods show promising quality, their reliance on the mesh domain limits their ability to fully leverage the rapid visual innovations seen in 2D computer vision and graphics. We propose VisualSpeaker, a novel method that bridges this gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation. Our contribution is a perceptual lip-reading loss, derived by passing photorealistic 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training. Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation. This perceptual focus naturally supports accurate mouthings, essential cues that disambiguate similar manual signs in sign language avatars.
Related papers
- ScaffoldAvatar: High-Fidelity Gaussian Avatars with Patch Expressions [49.34398022152462]
We propose to couple locally-defined facial expressions with 3D Gaussian splatting to enable creating ultra-high fidelity, expressive and photorealistic 3D head avatars.<n>In particular, we leverage a patch-based geometric 3D face model to extract patch expressions and learn how to translate these into local dynamic skin appearance and motion.<n>We employ color-based densification and progressive training to obtain high-quality results and faster convergence for high resolution 3K training images.
arXiv Detail & Related papers (2025-07-14T17:59:03Z) - SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents [91.26239311240873]
SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars.<n>A key innovation is an autonomous verification loop, where the agent renders draft avatars.<n>The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance.
arXiv Detail & Related papers (2025-06-05T03:49:01Z) - EVA: Expressive Virtual Avatars from Multi-view Videos [51.33851869426057]
We introduce Expressive Virtual Avatars (EVA), an actor-specific, fully controllable, and expressive human avatar framework.<n>EVA achieves high-fidelity, lifelike renderings in real time while enabling independent control of facial expressions, body movements, and hand gestures.<n>This work represents a significant advancement towards fully drivable digital human models.
arXiv Detail & Related papers (2025-05-21T11:22:52Z) - GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio.<n>We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z) - DEGAS: Detailed Expressions on Full-Body Gaussian Avatars [13.683836322899953]
We present DEGAS, the first 3D Gaussian Splatting (3DGS)-based modeling method for full-body avatars with rich facial expressions.<n>We propose to adopt the expression latent space trained solely on 2D portrait images, bridging the gap between 2D talking faces and 3D avatars.
arXiv Detail & Related papers (2024-08-20T06:52:03Z) - Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert [13.60808166889775]
We introduce a method for speech-driven 3D facial animation to generate accurate lip movements.
This loss provides guidance to train the speech-driven 3D facial animators to generate plausible lip motions aligned with the spoken transcripts.
We validate the effectiveness of our approach through broad experiments, showing noticeable improvements in lip synchronization and lip readability performance.
arXiv Detail & Related papers (2024-07-01T07:39:28Z) - Attention-Based VR Facial Animation with Visual Mouth Camera Guidance
for Immersive Telepresence Avatars [19.70403947793871]
We present a hybrid method that uses both keypoints and direct visual guidance from a mouth camera.
Our method generalizes to unseen operators and requires only a quick enrolment step with capture of two short videos.
We highlight how the facial animation contributed to our victory at the ANA Avatar XPRIZE Finals.
arXiv Detail & Related papers (2023-12-15T12:45:11Z) - FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation
Synthesis Using Self-Supervised Speech Representation Learning [0.0]
FaceXHuBERT is a text-less speech-driven 3D facial animation generation method.
It is very robust to background noise and can handle audio recorded in a variety of situations.
It produces superior results with respect to the realism of the animation 78% of the time.
arXiv Detail & Related papers (2023-03-09T17:05:19Z) - Pixel Codec Avatars [99.36561532588831]
Pixel Codec Avatars (PiCA) is a deep generative model of 3D human faces.
On a single Oculus Quest 2 mobile VR headset, 5 avatars are rendered in realtime in the same scene.
arXiv Detail & Related papers (2021-04-09T23:17:36Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.