A Cross-Corpus Speech Emotion Recognition Method Based on Supervised Contrastive Learning
- URL: http://arxiv.org/abs/2411.19803v1
- Date: Mon, 25 Nov 2024 07:03:31 GMT
- Title: A Cross-Corpus Speech Emotion Recognition Method Based on Supervised Contrastive Learning
- Authors: Xiang minjie,
- Abstract summary: This paper proposes a cross-corpus speech emotion recognition method based on supervised contrast learning.
The method employs a two-stage fine-tuning process: first, the self-supervised speech representation model is fine-tuned using supervised contrastive learning on multiple speech emotion datasets.
The experimental results show that the WavLM-based model achieved unweighted accuracy (UA) of 77.41% on the IEMOCAP dataset and 96.49% on the CASIA dataset.
- Score: 0.0
- License:
- Abstract: Research on Speech Emotion Recognition (SER) often faces challenges such as the lack of large-scale public datasets and limited generalization capability when dealing with data from different distributions. To solve this problem, this paper proposes a cross-corpus speech emotion recognition method based on supervised contrast learning. The method employs a two-stage fine-tuning process: first, the self-supervised speech representation model is fine-tuned using supervised contrastive learning on multiple speech emotion datasets; then, the classifier is fine-tuned on the target dataset. The experimental results show that the WavLM-based model achieved unweighted accuracy (UA) of 77.41% on the IEMOCAP dataset and 96.49% on the CASIA dataset, outperforming the state-of-the-art results on the two datasets.
Related papers
- EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life Speech [2.1455880234227624]
Spontaneous datasets for Speech Emotion Recognition (SER) are scarce and frequently derived from laboratory environments or staged scenarios.
We developed and publicly released the Emotional Voice Messages (EMOVOME) dataset, including 999 voice messages from real conversations of 100 Spanish speakers on a messaging app.
We evaluated speaker-independent SER models using acoustic features as baseline and transformer-based models.
arXiv Detail & Related papers (2024-03-04T16:13:39Z) - A Comparative Study of Data Augmentation Techniques for Deep Learning
Based Emotion Recognition [11.928873764689458]
We conduct a comprehensive evaluation of popular deep learning approaches for emotion recognition.
We show that long-range dependencies in the speech signal are critical for emotion recognition.
Speed/rate augmentation offers the most robust performance gain across models.
arXiv Detail & Related papers (2022-11-09T17:27:03Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - M2R2: Missing-Modality Robust emotion Recognition framework with
iterative data augmentation [6.962213869946514]
We propose Missing-Modality Robust emotion Recognition (M2R2), which trains emotion recognition model with iterative data augmentation by learned common representation.
Party Attentive Network (PANet) is designed to classify emotions, which tracks all the speakers' states and context.
arXiv Detail & Related papers (2022-05-05T09:16:31Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Towards Unbiased Visual Emotion Recognition via Causal Intervention [63.74095927462]
We propose a novel Emotion Recognition Network (IERN) to alleviate the negative effects brought by the dataset bias.
A series of designed tests validate the effectiveness of IERN, and experiments on three emotion benchmarks demonstrate that IERN outperforms other state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-26T10:40:59Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - Contrastive Unsupervised Learning for Speech Emotion Recognition [22.004507213531102]
Speech emotion recognition (SER) is a key technology to enable more natural human-machine communication.
We show that the contrastive predictive coding (CPC) method can learn salient representations from unlabeled datasets.
arXiv Detail & Related papers (2021-02-12T06:06:02Z) - Deep F-measure Maximization for End-to-End Speech Understanding [52.36496114728355]
We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.
We perform experiments on two standard fairness datasets, Adult, Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset.
In all four of these tasks, F-measure results in improved micro-F1 scores, with absolute improvements of up to 8% absolute, as compared to models trained with the cross-entropy loss function.
arXiv Detail & Related papers (2020-08-08T03:02:27Z) - A Transfer Learning Method for Speech Emotion Recognition from Automatic
Speech Recognition [0.0]
We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture.
We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
arXiv Detail & Related papers (2020-08-06T20:37:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.