Related papers: GaussianSpeech: Audio-Driven Gaussian Avatars

GaussianSpeech: Audio-Driven Gaussian Avatars

URL: http://arxiv.org/abs/2411.18675v1
Date: Wed, 27 Nov 2024 18:54:08 GMT
Title: GaussianSpeech: Audio-Driven Gaussian Avatars
Authors: Shivangi Aneja, Artem Sevastopolsky, Tobias Kirschstein, Justus Thies, Angela Dai, Matthias Nießner,
Abstract summary: We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio.<n>We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
Score: 76.10163891172192
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple speech signal with 3D Gaussian splatting to create realistic, temporally coherent motion sequences. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details, including wrinkles that occur with different expressions. To enable sequence modeling of 3D Gaussian splats with audio, we devise an audio-conditioned transformer model capable of extracting lip and expression features directly from audio input. Due to the absence of high-quality datasets of talking humans in correspondence with audio, we captured a new large-scale multi-view dataset of audio-visual sequences of talking humans with native English accents and diverse facial geometry. GaussianSpeech consistently achieves state-of-the-art performance with visually natural motion at real time rendering rates, while encompassing diverse facial expressions and styles.

Related papers

ScaffoldAvatar: High-Fidelity Gaussian Avatars with Patch Expressions [49.34398022152462]
We propose to couple locally-defined facial expressions with 3D Gaussian splatting to enable creating ultra-high fidelity, expressive and photorealistic 3D head avatars.<n>In particular, we leverage a patch-based geometric 3D face model to extract patch expressions and learn how to translate these into local dynamic skin appearance and motion.<n>We employ color-based densification and progressive training to obtain high-quality results and faster convergence for high resolution 3K training images.
arXiv Detail & Related papers (2025-07-14T17:59:03Z)
Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis [44.503709089687014]
We propose THUNDER, a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. We show that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.
arXiv Detail & Related papers (2025-04-18T00:24:52Z)
Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z)
GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting [27.699313086744237]
GaussianTalker is a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose.
arXiv Detail & Related papers (2024-04-22T09:51:43Z)
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z)
Pose-Controllable 3D Facial Animation Synthesis using Hierarchical Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention. The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z)
Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor. We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video. We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z)
A Novel Speech-Driven Lip-Sync Model with CNN and LSTM [12.747541089354538]
We present a combined deep neural network of one-dimensional convolutions and LSTM to generate displacement of a 3D template face model from variable-length speech input. In order to enhance the robustness of the network to different sound signals, we adapt a trained speech recognition model to extract speech feature. We show that our model is able to generate smooth and natural lip movements synchronized with speech.
arXiv Detail & Related papers (2022-05-02T13:57:50Z)
Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking. Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.