Audio Driven Real-Time Facial Animation for Social Telepresence
- URL: http://arxiv.org/abs/2510.01176v1
- Date: Wed, 01 Oct 2025 17:57:05 GMT
- Title: Audio Driven Real-Time Facial Animation for Social Telepresence
- Authors: Jiye Lee, Chenghui Li, Linh Tran, Shih-En Wei, Jason Saragih, Alexander Richard, Hanbyul Joo, Shaojie Bai,
- Abstract summary: We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency.<n>Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time.<n>We capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance.
- Score: 65.66220599734338
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (<15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 times faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.
Related papers
- REFA: Real-time Egocentric Facial Animations for Virtual Reality [56.82169742343143]
We present a novel system for real-time tracking of facial expressions using egocentric views captured from a set of infrared cameras embedded in a virtual reality (VR) headset.<n>Our technology facilitates any user to accurately drive the facial expressions of virtual characters in a non-intrusive manner.
arXiv Detail & Related papers (2026-01-07T01:41:46Z) - PersonaLive! Expressive Portrait Image Animation for Live Streaming [53.63615310186964]
PersonaLive is a novel diffusion-based framework towards streaming real-time portrait animation.<n>We first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control.<n>Experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.
arXiv Detail & Related papers (2025-12-12T03:24:40Z) - StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model [73.30619724574642]
Speech-driven 3D facial animation aims to generate realistic and synchronized facial motions driven by speech inputs.<n>Recent methods have employed audio-conditioned diffusion models for 3D facial animation.<n>We propose a novel autoregressive diffusion model that processes audio in a streaming manner.
arXiv Detail & Related papers (2025-11-18T07:55:16Z) - AvatarSync: Rethinking Talking-Head Animation through Phoneme-Guided Autoregressive Perspective [15.69417162113696]
AvatarSync is an autoregressive framework on phoneme representations that generates realistic talking-head animations from a single reference image.<n>We show that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency.
arXiv Detail & Related papers (2025-09-15T15:34:02Z) - VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis [70.76837748695841]
We propose VisualSpeaker, a novel method that bridges the gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation.<n>Our contribution is a perceptual lip-reading loss, derived by passing 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training.<n> Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation.
arXiv Detail & Related papers (2025-07-08T15:04:17Z) - Universal Facial Encoding of Codec Avatars from VR Headsets [32.60236093340087]
We present a method that can animate a photorealistic avatar in realtime from head-mounted cameras (HMCs) on a consumer VR headset.
We present a lightweight expression calibration mechanism that increases accuracy with minimal additional cost to run-time efficiency.
arXiv Detail & Related papers (2024-07-17T22:08:15Z) - VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time [35.43018966749148]
We introduce VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) given a single static image and a speech audio clip.
VASA-1 is capable of not only generating lip movements that are exquisitely synchronized with the audio, but also producing a large spectrum of facial nuances and natural head motions.
arXiv Detail & Related papers (2024-04-16T15:43:22Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior [27.989344587876964]
Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness.
We propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook.
We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-01-06T05:04:32Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.