Human Feedback Driven Dynamic Speech Emotion Recognition
- URL: http://arxiv.org/abs/2508.14920v1
- Date: Mon, 18 Aug 2025 17:25:27 GMT
- Title: Human Feedback Driven Dynamic Speech Emotion Recognition
- Authors: Ilya Fedorov, Dmitry Korobchenko,
- Abstract summary: The study particularly focuses on the animation of emotional 3D avatars.<n>We propose a multi-stage method that includes the training of a classical speech emotion recognition model.<n>We introduce a novel approach to modeling emotional mixtures based on the Dirichlet distribution.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work proposes to explore a new area of dynamic speech emotion recognition. Unlike traditional methods, we assume that each audio track is associated with a sequence of emotions active at different moments in time. The study particularly focuses on the animation of emotional 3D avatars. We propose a multi-stage method that includes the training of a classical speech emotion recognition model, synthetic generation of emotional sequences, and further model improvement based on human feedback. Additionally, we introduce a novel approach to modeling emotional mixtures based on the Dirichlet distribution. The models are evaluated based on ground-truth emotions extracted from a dataset of 3D facial animations. We compare our models against the sliding window approach. Our experimental results show the effectiveness of Dirichlet-based approach in modeling emotional mixtures. Incorporating human feedback further improves the model quality while providing a simplified annotation procedure.
Related papers
- A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction [50.05919688888947]
This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT)<n>IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision.<n> Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation.
arXiv Detail & Related papers (2026-01-08T14:07:30Z) - Evaluation of Generative Models for Emotional 3D Animation Generation in VR [14.647008099740356]
We evaluate emotional 3D animation generative models within a Virtual Reality (VR) environment.<n>We examine perceived emotional quality for three state of the art speech-driven 3D animation methods across two emotions happiness (high arousal) and neutral (mid arousal)<n>Our results demonstrate that methods explicitly modeling emotions lead to higher recognition accuracy compared to those focusing solely on speech-driven synchrony.
arXiv Detail & Related papers (2025-12-18T01:56:22Z) - PESTalk: Speech-Driven 3D Facial Animation with Personalized Emotional Styles [28.64103504776712]
PESTalk is a novel method for generating 3D facial animations with personalized emotional styles directly from speech.<n>It overcomes key limitations of existing approaches by introducing a Dual-Stream Emotion Extractor (DSEE) that captures both time and frequency-domain audio features.<n>It also introduces an Emotional Style Modeling Module (ESMM) that models individual expression patterns based on voiceprint characteristics.
arXiv Detail & Related papers (2025-10-13T13:21:38Z) - Taming Transformer for Emotion-Controllable Talking Face Generation [61.835295250047196]
We propose a novel method to tackle the emotion-controllable talking face generation task discretely.<n>Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens.<n>We conduct experiments on the MEAD dataset that controls the emotion of videos conditioned on multiple emotional audios.
arXiv Detail & Related papers (2025-08-20T02:16:52Z) - EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models [66.67979602235015]
EmoDiffusion is a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions.<n>We capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone.
arXiv Detail & Related papers (2025-03-14T02:54:22Z) - ESARM: 3D Emotional Speech-to-Animation via Reward Model from Automatically-Ranked Demonstrations [22.85503397110192]
This paper proposes a novel 3D speech-to-animation (STA) generation framework to address the shortcomings of existing models.
We introduce a novel STA model coupled with a reward model. This combination enables the decoupling of emotion and content under audio conditions.
We conduct extensive empirical experiments on a benchmark dataset, and the results validate the effectiveness of our proposed framework.
arXiv Detail & Related papers (2024-11-20T07:37:37Z) - CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation [13.27632316528572]
Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations.
Main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions.
This paper proposes a method called CSTalk that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions.
arXiv Detail & Related papers (2024-04-29T11:19:15Z) - GMTalker: Gaussian Mixture-based Audio-Driven Emotional Talking Video Portraits [60.05683966405544]
We present GMTalker, a Gaussian mixture-based emotional talking portraits generation framework.<n>Specifically, we propose a continuous and disentangled latent space, achieving more flexible emotion manipulation.<n>We also introduce a normalizing flow-based motion generator pretrained on a large dataset to generate diverse head poses, blinks, and eyeball movements.
arXiv Detail & Related papers (2023-12-12T19:03:04Z) - Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions.
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z) - EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware
Motion Model [32.19539143308341]
We propose the Emotion-Aware Motion Model (EAMM) to generate one-shot emotional talking faces.
By incorporating the results from both modules, our method can generate satisfactory talking face results on arbitrary subjects.
arXiv Detail & Related papers (2022-05-30T17:39:45Z) - Enhancing Cognitive Models of Emotions with Representation Learning [58.2386408470585]
We present a novel deep learning-based framework to generate embedding representations of fine-grained emotions.
Our framework integrates a contextualized embedding encoder with a multi-head probing model.
Our model is evaluated on the Empathetic Dialogue dataset and shows the state-of-the-art result for classifying 32 emotions.
arXiv Detail & Related papers (2021-04-20T16:55:15Z) - Affect2MM: Affective Analysis of Multimedia Content Using Emotion
Causality [84.69595956853908]
We present Affect2MM, a learning method for time-series emotion prediction for multimedia content.
Our goal is to automatically capture the varying emotions depicted by characters in real-life human-centric situations and behaviors.
arXiv Detail & Related papers (2021-03-11T09:07:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.