Emotional Speech-Driven Animation with Content-Emotion Disentanglement
- URL: http://arxiv.org/abs/2306.08990v2
- Date: Tue, 26 Sep 2023 16:25:03 GMT
- Title: Emotional Speech-Driven Animation with Content-Emotion Disentanglement
- Authors: Radek Dan\v{e}\v{c}ek, Kiran Chhatre, Shashank Tripathi, Yandong Wen,
Michael J. Black, Timo Bolkart
- Abstract summary: We propose EMOTE, which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion.
EmOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data.
- Score: 51.34635009347183
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To be widely adopted, 3D facial avatars must be animated easily,
realistically, and directly from speech signals. While the best recent methods
generate 3D animations that are synchronized with the input audio, they largely
ignore the impact of emotions on facial expressions. Realistic facial animation
requires lip-sync together with the natural expression of emotion. To that end,
we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which
generates 3D talking-head avatars that maintain lip-sync from speech while
enabling explicit control over the expression of emotion. To achieve this, we
supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion.
These losses are based on two key observations: (1) deformations of the face
due to speech are spatially localized around the mouth and have high temporal
frequency, whereas (2) facial expressions may deform the whole face and occur
over longer intervals. Thus, we train EMOTE with a per-frame lip-reading loss
to preserve the speech-dependent content, while supervising emotion at the
sequence level. Furthermore, we employ a content-emotion exchange mechanism in
order to supervise different emotions on the same audio, while maintaining the
lip motion synchronized with the speech. To employ deep perceptual losses
without getting undesirable artifacts, we devise a motion prior in the form of
a temporal VAE. Due to the absence of high-quality aligned emotional 3D face
datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted
from an emotional video dataset (i.e., MEAD). Extensive qualitative and
perceptual evaluations demonstrate that EMOTE produces speech-driven facial
animations with better lip-sync than state-of-the-art methods trained on the
same data, while offering additional, high-quality emotional control.
Related papers
- EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion [5.954758598327494]
EMOdiffhead is a novel method for emotional talking head video generation.
It enables fine-grained control of emotion categories and intensities.
It achieves state-of-the-art performance compared to other emotion portrait animation methods.
arXiv Detail & Related papers (2024-09-11T13:23:22Z) - DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation [14.07086606183356]
Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications.
Current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion.
We introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs.
arXiv Detail & Related papers (2024-08-12T08:56:49Z) - EmoFace: Audio-driven Emotional 3D Face Animation [3.573880705052592]
EmoFace is a novel audio-driven methodology for creating facial animations with vivid emotional dynamics.
Our approach can generate facial expressions with multiple emotions, and has the ability to generate random yet natural blinks and eye movements.
Our proposed methodology can be applied in producing dialogues animations of non-playable characters in video games, and driving avatars in virtual reality environments.
arXiv Detail & Related papers (2024-07-17T11:32:16Z) - EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face
Generation [34.5592743467339]
We propose a visual attribute-guided audio decoupler to generate fine-grained facial animations.
To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module.
Our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization.
arXiv Detail & Related papers (2024-02-02T14:04:18Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for
Single Image Talking Face Generation [75.90730434449874]
We introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently.
Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style.
Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.
arXiv Detail & Related papers (2023-12-21T05:03:18Z) - EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation [28.964917860664492]
Speech-driven 3D face animation aims to generate realistic facial expressions that match the speech content and emotion.
This paper proposes an end-to-end neural network to disentangle different emotions in speech so as to generate rich 3D facial expressions.
Our approach outperforms state-of-the-art methods and exhibits more diverse facial movements.
arXiv Detail & Related papers (2023-03-20T13:22:04Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - EMOCA: Emotion Driven Monocular Face Capture and Animation [59.15004328155593]
We introduce a novel deep perceptual emotion consistency loss during training, which helps ensure that the reconstructed 3D expression matches the expression depicted in the input image.
On the task of in-the-wild emotion recognition, our purely geometric approach is on par with the best image-based methods, highlighting the value of 3D geometry in analyzing human behavior.
arXiv Detail & Related papers (2022-04-24T15:58:35Z) - Audio-Driven Emotional Video Portraits [79.95687903497354]
We present Emotional Video Portraits (EVP), a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audios.
Specifically, we propose the Cross-Reconstructed Emotion Disentanglement technique to decompose speech into two decoupled spaces.
With the disentangled features, dynamic 2D emotional facial landmarks can be deduced.
Then we propose the Target-Adaptive Face Synthesis technique to generate the final high-quality video portraits.
arXiv Detail & Related papers (2021-04-15T13:37:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.