Related papers: EmoVOCA: Speech-Driven Emotional 3D Talking Heads

EmoVOCA: Speech-Driven Emotional 3D Talking Heads

URL: http://arxiv.org/abs/2403.12886v2
Date: Wed, 11 Sep 2024 14:28:08 GMT
Title: EmoVOCA: Speech-Driven Emotional 3D Talking Heads
Authors: Federico Nocentini, Claudio Ferrari, Stefano Berretti,
Abstract summary: We propose an innovative data-driven technique for creating a synthetic dataset, called EmoVOCA. We then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face.
Score: 12.161006152509653
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The domain of 3D talking head generation has witnessed significant progress in recent years. A notable challenge in this field consists in blending speech-related motions with expression dynamics, which is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Whereas literature works attempted to exploit 2D video data and parametric 3D models as a workaround, these still show limitations when jointly modeling the two motions. In this work, we address this problem from a different perspective, and propose an innovative data-driven technique that we used for creating a synthetic dataset, called EmoVOCA, obtained by combining a collection of inexpressive 3D talking heads and a set of 3D expressive sequences. To demonstrate the advantages of this approach, and the quality of the dataset, we then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments, both quantitative and qualitative, using our data and generator evidence superior ability in synthesizing convincing animations, when compared with the best performing methods in the literature. Our code and pre-trained model will be made available.

Related papers

Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation [20.91704034858042]
We model facial animation driven by both speech and emotion as a linear additive problem.<n>We learn a set of blendshapes driven by speech and emotion that can be mapped to the expression and jaw pose parameters of the FLAME model.<n>Our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.
arXiv Detail & Related papers (2025-10-29T07:29:21Z)
EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models [66.67979602235015]
EmoDiffusion is a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions.<n>We capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone.
arXiv Detail & Related papers (2025-03-14T02:54:22Z)
Towards Rich Emotions in 3D Avatars: A Text-to-3D Avatar Generation Benchmark [78.72350264142987]
producing emotionally dynamic 3D facial avatars with text derived from spoken words (Emo3D) has been a pivotal research topic in 3D avatar generation. This paper reexamines Emo3D generation and draws inspiration from human processes, breaking down Emo3D into two cascading steps: Text-to-3D Expression Mapping (T3DEM) and 3D Avatar Rendering (3DAR) To address these challenges, we introduce a novel benchmark to advance research in Emo3D generation.
arXiv Detail & Related papers (2024-12-03T15:39:05Z)
GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z)
MMHead: Towards Fine-grained Multi-modal 3D Facial Animation [68.04052669266174]
We construct a large-scale multi-modal 3D facial animation dataset, MMHead. MMHead consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations. Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation.
arXiv Detail & Related papers (2024-10-10T09:37:01Z)
Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description [3.52270271101496]
"Emo3D" is an extensive "Text-Image-Expression dataset" spanning a wide spectrum of human emotions. We generate a diverse array of textual descriptions, facilitating the capture of a broad spectrum of emotional expressions. "Emo3D" has great applications in animation design, virtual reality, and emotional human-computer interaction.
arXiv Detail & Related papers (2024-10-02T21:31:24Z)
Story3D-Agent: Exploring 3D Storytelling Visualization with Large Language Models [57.30913211264333]
We present Story3D-Agent, a pioneering approach that transforms provided narratives into 3D-rendered visualizations. By integrating procedural modeling, our approach enables precise control over multi-character actions and motions, as well as diverse decorative elements. We have thoroughly evaluated our Story3D-Agent to validate its effectiveness, offering a basic framework to advance 3D story representation.
arXiv Detail & Related papers (2024-08-21T17:43:15Z)
EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head [30.138347111341748]
We present a novel approach for synthesizing 3D talking heads with controllable emotion. Our model enables controllable emotion in the generated talking heads and can be rendered in wide-range views. Experiments demonstrate the effectiveness of our approach in generating high-fidelity and emotion-controllable 3D talking heads.
arXiv Detail & Related papers (2024-08-01T05:46:57Z)
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z)
Pose-Controllable 3D Facial Animation Synthesis using Hierarchical Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention. The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z)
Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z)
SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation [33.651156455111916]
We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio. Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces.
arXiv Detail & Related papers (2022-11-22T11:35:07Z)
3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head [13.305263646852087]
We introduce 3D-TalkEmo, a deep neural network that generates 3D talking head animation with various emotions. We also create a large 3D dataset with synchronized audios and videos, rich corpus, as well as various emotion states of different persons.
arXiv Detail & Related papers (2021-04-25T02:48:19Z)
Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.