High-fidelity Generalized Emotional Talking Face Generation with
Multi-modal Emotion Space Learning
- URL: http://arxiv.org/abs/2305.02572v2
- Date: Wed, 31 May 2023 03:41:12 GMT
- Title: High-fidelity Generalized Emotional Talking Face Generation with
Multi-modal Emotion Space Learning
- Authors: Chao Xu, Junwei Zhu, Jiangning Zhang, Yue Han, Wenqing Chu, Ying Tai,
Chengjie Wang, Zhifeng Xie, Yong Liu
- Abstract summary: We propose a more flexible and generalized framework for talking face generation.
Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space.
An Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation.
- Score: 43.09015109281053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, emotional talking face generation has received considerable
attention. However, existing methods only adopt one-hot coding, image, or audio
as emotion conditions, thus lacking flexible control in practical applications
and failing to handle unseen emotion styles due to limited semantics. They
either ignore the one-shot setting or the quality of generated faces. In this
paper, we propose a more flexible and generalized framework. Specifically, we
supplement the emotion style in text prompts and use an Aligned Multi-modal
Emotion encoder to embed the text, image, and audio emotion modality into a
unified space, which inherits rich semantic prior from CLIP. Consequently,
effective multi-modal emotion space learning helps our method support arbitrary
emotion modality during testing and could generalize to unseen emotion styles.
Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the
emotion condition and the audio sequence to structural representation. A
followed style-based High-fidelity Emotional Face generator is designed to
generate arbitrary high-resolution realistic identities. Our texture generator
hierarchically learns flow fields and animated faces in a residual manner.
Extensive experiments demonstrate the flexibility and generalization of our
method in emotion control and the effectiveness of high-quality face synthesis.
Related papers
- Emotional Face-to-Speech [13.725558939494407]
Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression.
We introduce DEmoFace, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning.
We develop an enhanced predictor-free guidance to handle diverse conditioning scenarios, enabling multi-conditional generation and disentangling complex attributes effectively.
arXiv Detail & Related papers (2025-02-03T04:48:50Z) - MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation [39.30784838378127]
The generation of talking avatars has achieved significant advancements in precise audio synchronization.
Current methods face fundamental challenges, including the lack of frameworks for modeling single basic emotional expressions.
We propose the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental emotions to enable the precise synthesis of both singular and compound emotional states.
In conjunction with the DH-FaceEmoVid-150 dataset, we demonstrate that the MoEE framework excels in generating complex emotional expressions and nuanced facial details.
arXiv Detail & Related papers (2025-01-03T13:43:21Z) - EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector [26.656512860918262]
EmoSphere++ is an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech.
We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation.
We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps.
arXiv Detail & Related papers (2024-11-04T21:33:56Z) - EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control [7.596581158724187]
EmoKnob is a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion.
We show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.
arXiv Detail & Related papers (2024-10-01T01:29:54Z) - EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face
Generation [34.5592743467339]
We propose a visual attribute-guided audio decoupler to generate fine-grained facial animations.
To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module.
Our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization.
arXiv Detail & Related papers (2024-02-02T14:04:18Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Emotionally Enhanced Talking Face Generation [52.07451348895041]
We build a talking face generation framework conditioned on a categorical emotion to generate videos with appropriate expressions.
We show that our model can adapt to arbitrary identities, emotions, and languages.
Our proposed framework is equipped with a user-friendly web interface with a real-time experience for talking face generation with emotions.
arXiv Detail & Related papers (2023-03-21T02:33:27Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - Emotion Recognition from Multiple Modalities: Fundamentals and
Methodologies [106.62835060095532]
We discuss several key aspects of multi-modal emotion recognition (MER)
We begin with a brief introduction on widely used emotion representation models and affective modalities.
We then summarize existing emotion annotation strategies and corresponding computational tasks.
Finally, we outline several real-world applications and discuss some future directions.
arXiv Detail & Related papers (2021-08-18T21:55:20Z) - Enhancing Cognitive Models of Emotions with Representation Learning [58.2386408470585]
We present a novel deep learning-based framework to generate embedding representations of fine-grained emotions.
Our framework integrates a contextualized embedding encoder with a multi-head probing model.
Our model is evaluated on the Empathetic Dialogue dataset and shows the state-of-the-art result for classifying 32 emotions.
arXiv Detail & Related papers (2021-04-20T16:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.