Attention-based Region of Interest (ROI) Detection for Speech Emotion
Recognition
- URL: http://arxiv.org/abs/2203.03428v1
- Date: Thu, 3 Mar 2022 22:01:48 GMT
- Title: Attention-based Region of Interest (ROI) Detection for Speech Emotion
Recognition
- Authors: Jay Desai, Houwei Cao, Ravi Shah
- Abstract summary: We propose to use attention mechanism in deep recurrentneural networks to detection the Regions-of-Interest (ROI) thatare more emotionally salient in human emotional speech/video.
We comparethe performance of the proposed attention networks with thestate-of-the-art LSTM models on multi-class classification task ofrecognizing six basic human emotions.
- Score: 4.610756199751138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic emotion recognition for real-life appli-cations is a challenging
task. Human emotion expressions aresubtle, and can be conveyed by a combination
of several emo-tions. In most existing emotion recognition studies, each
audioutterance/video clip is labelled/classified in its entirety.
However,utterance/clip-level labelling and classification can be too coarseto
capture the subtle intra-utterance/clip temporal dynamics. Forexample, an
utterance/video clip usually contains only a fewemotion-salient regions and
many emotionless regions. In thisstudy, we propose to use attention mechanism
in deep recurrentneural networks to detection the Regions-of-Interest (ROI)
thatare more emotionally salient in human emotional speech/video,and further
estimate the temporal emotion dynamics by aggre-gating those emotionally
salient regions-of-interest. We comparethe ROI from audio and video and analyse
them. We comparethe performance of the proposed attention networks with
thestate-of-the-art LSTM models on multi-class classification task
ofrecognizing six basic human emotions, and the proposed attentionmodels
exhibit significantly better performance. Furthermore, theattention weight
distribution can be used to interpret how anutterance can be expressed as a
mixture of possible emotions.
Related papers
- ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains [61.50113532215864]
Causal Emotion Entailment (CEE) aims to identify the causal utterances in a conversation that stimulate the emotions expressed in a target utterance.
Current works in CEE mainly focus on modeling semantic and emotional interactions in conversations.
We introduce a step-by-step reasoning method, Emotion-Cause Reasoning Chain (ECR-Chain), to infer the stimulus from the target emotional expressions in conversations.
arXiv Detail & Related papers (2024-05-17T15:45:08Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised
Representations and Neural Vocoder-based Resynthesis [15.16865739526702]
We introduce a methodology that uses self-supervised networks to disentangle the lexical, speaker, and emotional content of the utterance.
We then use a HiFiGAN vocoder to resynthesise the disentangled representations to a speech signal of the targeted emotion.
Results reveal that the proposed approach is aptly conditioned on the emotional content of input speech and is capable of synthesising natural-sounding speech for a target emotion.
arXiv Detail & Related papers (2023-06-02T21:02:51Z) - Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions.
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - Multi-Cue Adaptive Emotion Recognition Network [4.570705738465714]
We propose a new deep learning approach for emotion recognition based on adaptive multi-cues.
We compare the proposed approach with the state-of-art approaches in the CAER-S dataset.
arXiv Detail & Related papers (2021-11-03T15:08:55Z) - Emotion Recognition under Consideration of the Emotion Component Process
Model [9.595357496779394]
We use the emotion component process model (CPM) by Scherer (2005) to explain emotion communication.
CPM states that emotions are a coordinated process of various subcomponents, in reaction to an event, namely the subjective feeling, the cognitive appraisal, the expression, a physiological bodily reaction, and a motivational action tendency.
We find that emotions on Twitter are predominantly expressed by event descriptions or subjective reports of the feeling, while in literature, authors prefer to describe what characters do, and leave the interpretation to the reader.
arXiv Detail & Related papers (2021-07-27T15:53:25Z) - A Circular-Structured Representation for Visual Emotion Distribution
Learning [82.89776298753661]
We propose a well-grounded circular-structured representation to utilize the prior knowledge for visual emotion distribution learning.
To be specific, we first construct an Emotion Circle to unify any emotional state within it.
On the proposed Emotion Circle, each emotion distribution is represented with an emotion vector, which is defined with three attributes.
arXiv Detail & Related papers (2021-06-23T14:53:27Z) - Detecting Emotion Primitives from Speech and their use in discerning
Categorical Emotions [16.886826928295203]
Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity.
This work investigated how emotion primitives can be used to detect categorical emotions such as happiness, disgust, contempt, anger, and surprise from neutral speech.
Results indicated that arousal, followed by dominance was a better detector of such emotions.
arXiv Detail & Related papers (2020-01-31T03:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.