Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition
- URL: http://arxiv.org/abs/2510.09072v1
- Date: Fri, 10 Oct 2025 07:17:07 GMT
- Title: Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition
- Authors: Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu,
- Abstract summary: This paper introduces a two-step approach to enhance robustness and generalization of speech emotion recognition models.<n> improved performance under challenging conditions demonstrates the effectiveness of the proposed method.
- Score: 7.243987016933212
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effectiveness of speech emotion recognition in real-world scenarios is often hindered by noisy environments and variability across datasets. This paper introduces a two-step approach to enhance the robustness and generalization of speech emotion recognition models through improved representation learning. First, our model employs EDRL (Emotion-Disentangled Representation Learning) to extract class-specific discriminative features while preserving shared similarities across emotion categories. Next, MEA (Multiblock Embedding Alignment) refines these representations by projecting them into a joint discriminative latent subspace that maximizes covariance with the original speech input. The learned EDRL-MEA embeddings are subsequently used to train an emotion classifier using clean samples from publicly available datasets, and are evaluated on unseen noisy and cross-corpus speech samples. Improved performance under these challenging conditions demonstrates the effectiveness of the proposed method.
Related papers
- EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast [24.24520233247336]
We introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations.<n>EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.
arXiv Detail & Related papers (2025-05-29T17:56:55Z) - VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection [50.57849622045192]
We propose VAEmo, an efficient framework for emotion-centric joint VA representation learning with external knowledge injection.<n>VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance.
arXiv Detail & Related papers (2025-05-05T03:00:51Z) - A Cross-Corpus Speech Emotion Recognition Method Based on Supervised Contrastive Learning [0.0]
This paper proposes a cross-corpus speech emotion recognition method based on supervised contrast learning.<n>The method employs a two-stage fine-tuning process: first, the self-supervised speech representation model is fine-tuned using supervised contrastive learning on multiple speech emotion datasets.<n>The experimental results show that the WavLM-based model achieved unweighted accuracy (UA) of 77.41% on the IEMOCAP dataset and 96.49% on the CASIA dataset.
arXiv Detail & Related papers (2024-11-25T07:03:31Z) - Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - Revealing Emotional Clusters in Speaker Embeddings: A Contrastive
Learning Strategy for Speech Emotion Recognition [27.098672790099304]
It has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization.
Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters.
We introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.
arXiv Detail & Related papers (2024-01-19T20:31:53Z) - Unifying the Discrete and Continuous Emotion labels for Speech Emotion
Recognition [28.881092401807894]
In paralinguistic analysis for emotion detection from speech, emotions have been identified with discrete or dimensional (continuous-valued) labels.
We propose a model to jointly predict continuous and discrete emotional attributes.
arXiv Detail & Related papers (2022-10-29T16:12:31Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Unsupervised low-rank representations for speech emotion recognition [78.38221758430244]
We examine the use of linear and non-linear dimensionality reduction algorithms for extracting low-rank feature representations for speech emotion recognition.
We report speech emotion recognition (SER) results for learned representations on two databases using different classification methods.
arXiv Detail & Related papers (2021-04-14T18:30:58Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.