Mouth Articulation-Based Anchoring for Improved Cross-Corpus Speech Emotion Recognition
- URL: http://arxiv.org/abs/2412.19909v1
- Date: Fri, 27 Dec 2024 20:00:45 GMT
- Title: Mouth Articulation-Based Anchoring for Improved Cross-Corpus Speech Emotion Recognition
- Authors: Shreya G. Upadhyay, Ali N. Salman, Carlos Busso, Chi-Chun Lee,
- Abstract summary: Cross-corpus speech emotion recognition (SER) plays a vital role in numerous practical applications.
Traditional approaches to cross-corpus emotion transfer often concentrate on adapting acoustic features to align with different corpora, domains, or labels.
This study adopts a novel contrastive approach by focusing on emotion-specific articulatory gestures as the core elements for analysis.
- Score: 37.57745459245874
- License:
- Abstract: Cross-corpus speech emotion recognition (SER) plays a vital role in numerous practical applications. Traditional approaches to cross-corpus emotion transfer often concentrate on adapting acoustic features to align with different corpora, domains, or labels. However, acoustic features are inherently variable and error-prone due to factors like speaker differences, domain shifts, and recording conditions. To address these challenges, this study adopts a novel contrastive approach by focusing on emotion-specific articulatory gestures as the core elements for analysis. By shifting the emphasis on the more stable and consistent articulatory gestures, we aim to enhance emotion transfer learning in SER tasks. Our research leverages the CREMA-D and MSP-IMPROV corpora as benchmarks and it reveals valuable insights into the commonality and reliability of these articulatory gestures. The findings highlight mouth articulatory gesture potential as a better constraint for improving emotion recognition across different settings or domains.
Related papers
- Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition [7.81011775615268]
We introduce MSAC-SERNet, a novel unified SER framework capable of simultaneously handling both single-corpus and cross-corpus SER.
Considering information overlap between various speech attributes, we propose a novel learning paradigm based on correlations of different speech attributes.
Experiments on both single-corpus and cross-corpus SER scenarios indicate that MSAC-SERNet achieves superior performance compared to state-of-the-art SER approaches.
arXiv Detail & Related papers (2023-08-08T03:43:24Z) - Empirical Interpretation of the Relationship Between Speech Acoustic
Context and Emotion Recognition [28.114873457383354]
Speech emotion recognition (SER) is vital for obtaining emotional intelligence and understanding the contextual meaning of speech.
In practice, speech emotions are treated as single labels over an acoustic segment for a given time duration.
This research explores the implication of acoustic context and phone boundaries on local markers for SER using an attention-based approach.
arXiv Detail & Related papers (2023-06-30T09:21:48Z) - Attention-based Region of Interest (ROI) Detection for Speech Emotion
Recognition [4.610756199751138]
We propose to use attention mechanism in deep recurrentneural networks to detection the Regions-of-Interest (ROI) thatare more emotionally salient in human emotional speech/video.
We comparethe performance of the proposed attention networks with thestate-of-the-art LSTM models on multi-class classification task ofrecognizing six basic human emotions.
arXiv Detail & Related papers (2022-03-03T22:01:48Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - Acted vs. Improvised: Domain Adaptation for Elicitation Approaches in
Audio-Visual Emotion Recognition [29.916609743097215]
Key challenges in developing generalized automatic emotion recognition systems include scarcity of labeled data and lack of gold-standard references.
In this work, we regard the emotion elicitation approach as domain knowledge, and explore domain transfer learning techniques on emotional utterances.
arXiv Detail & Related papers (2021-04-05T15:59:31Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z) - COSMIC: COmmonSense knowledge for eMotion Identification in
Conversations [95.71018134363976]
We propose COSMIC, a new framework that incorporates different elements of commonsense such as mental states, events, and causal relations.
We show that COSMIC achieves new state-of-the-art results for emotion recognition on four different benchmark conversational datasets.
arXiv Detail & Related papers (2020-10-06T15:09:38Z) - Detecting Emotion Primitives from Speech and their use in discerning
Categorical Emotions [16.886826928295203]
Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity.
This work investigated how emotion primitives can be used to detect categorical emotions such as happiness, disgust, contempt, anger, and surprise from neutral speech.
Results indicated that arousal, followed by dominance was a better detector of such emotions.
arXiv Detail & Related papers (2020-01-31T03:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.