Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks,
Methods, and Applications
- URL: http://arxiv.org/abs/2311.18168v1
- Date: Thu, 30 Nov 2023 01:14:43 GMT
- Title: Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks,
Methods, and Applications
- Authors: Karren D. Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja
Vemulapalli, Oncel Tuzel
- Abstract summary: We consider the task of animating 3D facial geometry from speech signal.
Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers.
- Score: 20.842799581850617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the task of animating 3D facial geometry from speech signal.
Existing works are primarily deterministic, focusing on learning a one-to-one
mapping from speech signal to 3D face meshes on small datasets with limited
speakers. While these models can achieve high-quality lip articulation for
speakers in the training set, they are unable to capture the full and diverse
distribution of 3D facial motions that accompany speech in the real world.
Importantly, the relationship between speech and facial motion is one-to-many,
containing both inter-speaker and intra-speaker variations and necessitating a
probabilistic approach. In this paper, we identify and address key challenges
that have so far limited the development of probabilistic models: lack of
datasets and metrics that are suitable for training and evaluating them, as
well as the difficulty of designing a model that generates diverse results
while remaining faithful to a strong conditioning signal as speech. We first
propose large-scale benchmark datasets and metrics suitable for probabilistic
modeling. Then, we demonstrate a probabilistic model that achieves both
diversity and fidelity to speech, outperforming other methods across the
proposed benchmarks. Finally, we showcase useful applications of probabilistic
models trained on these large-scale datasets: we can generate diverse
speech-driven 3D facial motion that matches unseen speaker styles extracted
from reference clips; and our synthetic meshes can be used to improve the
performance of downstream audio-visual models.
Related papers
- Diverse Code Query Learning for Speech-Driven Facial Animation [2.1779479916071067]
Speech-driven facial animation aims to synthesize lip-synchronized 3D talking faces following the given speech signal.
We propose predicting multiple samples conditioned on the same audio signal and then explicitly encouraging sample diversity to address diverse facial animation.
arXiv Detail & Related papers (2024-09-27T21:15:21Z) - SAiD: Speech-driven Blendshape Facial Animation with Diffusion [6.4271091365094515]
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets.
We propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization.
arXiv Detail & Related papers (2023-12-25T04:40:32Z) - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal.
To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D
Facial Animation [10.73030153404956]
We propose a cross-modal dual-learning framework, termed DualTalker, to improve data usage efficiency.
The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components.
Our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-08T15:39:56Z) - BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer [42.87095473590205]
We propose a novel framework for automatic 3D body gesture synthesis from speech.
Our system is trained with either the Trinity speech-gesture dataset or the Talking With Hands 16.2M dataset.
The results show that our system can produce more realistic, appropriate, and diverse body gestures compared to existing state-of-the-art approaches.
arXiv Detail & Related papers (2023-09-07T01:11:11Z) - Parametric Implicit Face Representation for Audio-Driven Facial
Reenactment [52.33618333954383]
We propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads.
Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models.
Our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.
arXiv Detail & Related papers (2023-06-13T07:08:22Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.