Related papers: Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation

Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation

URL: http://arxiv.org/abs/2409.19501v1
Date: Sun, 29 Sep 2024 01:02:01 GMT
Title: Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation
Authors: Jingyi Xu, Hieu Le, Zhixin Shu, Yang Wang, Yi-Hsuan Tsai, Dimitris Samaras,
Abstract summary: We propose a method for capturing and generating subtle shifts for talking-head generation. We develop a talking-head framework that is capable of generating a variety of emotions with precise control over intensity levels. Experiments and analyses validate the effectiveness of our proposed method.
Score: 59.81482518924723
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human emotional expression is inherently dynamic, complex, and fluid, characterized by smooth transitions in intensity throughout verbal communication. However, the modeling of such intensity fluctuations has been largely overlooked by previous audio-driven talking-head generation methods, which often results in static emotional outputs. In this paper, we explore how emotion intensity fluctuates during speech, proposing a method for capturing and generating these subtle shifts for talking-head generation. Specifically, we develop a talking-head framework that is capable of generating a variety of emotions with precise control over intensity levels. This is achieved by learning a continuous emotion latent space, where emotion types are encoded within latent orientations and emotion intensity is reflected in latent norms. In addition, to capture the dynamic intensity fluctuations, we adopt an audio-to-intensity predictor by considering the speaking tone that reflects the intensity. The training signals for this predictor are obtained through our emotion-agnostic intensity pseudo-labeling method without the need of frame-wise intensity labeling. Extensive experiments and analyses validate the effectiveness of our proposed method in accurately capturing and reproducing emotion intensity fluctuations in talking-head generation, thereby significantly enhancing the expressiveness and realism of the generated outputs.

Related papers

Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation [7.362433184546492]
Emotional talking-head generation has emerged as a pivotal research area at the intersection of computer vision and multimodal artificial intelligence.<n>This study proposes the Think-Before-Draw framework to address two key challenges.
arXiv Detail & Related papers (2025-07-17T03:33:46Z)
Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics. We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z)
EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion [30.25632448893884]
The Emotional Voice Conversion (EVC) aims to convert the discrete emotional state from the source emotion to the target for a given speech utterance. We propose regularizing emotion intensity in the diffusion-based EVC framework to generate precise speech of the target emotion. The effectiveness of the proposed method has been shown across state-of-the-art (SOTA) baselines in terms of subjective and objective evaluations for the English and Hindi languages.
arXiv Detail & Related papers (2024-12-29T05:30:06Z)
When Words Smile: Generating Diverse Emotional Facial Expressions from Text [72.19705878257204]
We introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics.<n>Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent.
arXiv Detail & Related papers (2024-12-03T15:39:05Z)
EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech [34.03787613163788]
EmoSphere-TTS synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. We propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics.
arXiv Detail & Related papers (2024-06-12T01:40:29Z)
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)
GMTalker: Gaussian Mixture-based Audio-Driven Emotional Talking Video Portraits [60.05683966405544]
We present GMTalker, a Gaussian mixture-based emotional talking portraits generation framework. Specifically, we propose a continuous and disentangled latent space, achieving more flexible emotion manipulation. We also introduce a normalizing flow-based motion generator pretrained on a large dataset to generate diverse head poses, blinks, and eyeball movements.
arXiv Detail & Related papers (2023-12-12T19:03:04Z)
AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis [13.918119853846838]
Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations. We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space. We demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker.
arXiv Detail & Related papers (2023-08-16T06:28:29Z)
In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised Representations and Neural Vocoder-based Resynthesis [15.16865739526702]
We introduce a methodology that uses self-supervised networks to disentangle the lexical, speaker, and emotional content of the utterance. We then use a HiFiGAN vocoder to resynthesise the disentangled representations to a speech signal of the targeted emotion. Results reveal that the proposed approach is aptly conditioned on the emotional content of input speech and is capable of synthesising natural-sounding speech for a target emotion.
arXiv Detail & Related papers (2023-06-02T21:02:51Z)
Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations [16.524515747017787]
We propose a novel method to control the continuous intensity of emotions using semi-supervised learning. The experimental results showed that the proposed method was superior in controllability and naturalness.
arXiv Detail & Related papers (2022-11-11T12:28:07Z)
Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z)
Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z)
Emotion-aware Chat Machine: Automatic Emotional Response Generation for Human-like Emotional Interaction [55.47134146639492]
This article proposes a unifed end-to-end neural architecture, which is capable of simultaneously encoding the semantics and the emotions in a post. Experiments on real-world data demonstrate that the proposed method outperforms the state-of-the-art methods in terms of both content coherence and emotion appropriateness.
arXiv Detail & Related papers (2021-06-06T06:26:15Z)
Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN) We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.