Dawn of the transformer era in speech emotion recognition: closing the
valence gap
- URL: http://arxiv.org/abs/2203.07378v4
- Date: Thu, 7 Sep 2023 18:53:43 GMT
- Title: Dawn of the transformer era in speech emotion recognition: closing the
valence gap
- Authors: Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wierstorf,
Maximilian Schmitt, Felix Burkhardt, Florian Eyben, Bj\"orn W. Schuller
- Abstract summary: We investigate the influence of model size and pre-training data on downstream performance.
We fine-tune several pre-trained variants of wav2vec 2.0 and HuBERT and test cross-corpus generalisation.
Our investigations reveal that transformer-based architectures are more robust to small perturbations compared to a CNN-based baseline.
- Score: 9.514396745161793
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in transformer-based architectures which are pre-trained in
self-supervised manner have shown great promise in several machine learning
tasks. In the audio domain, such architectures have also been successfully
utilised in the field of speech emotion recognition (SER). However, existing
works have not evaluated the influence of model size and pre-training data on
downstream performance, and have shown limited attention to generalisation,
robustness, fairness, and efficiency. The present contribution conducts a
thorough analysis of these aspects on several pre-trained variants of wav2vec
2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and
valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test
cross-corpus generalisation. To the best of our knowledge, we obtain the top
performance for valence prediction without use of explicit linguistic
information, with a concordance correlation coefficient (CCC) of .638 on
MSP-Podcast. Furthermore, our investigations reveal that transformer-based
architectures are more robust to small perturbations compared to a CNN-based
baseline and fair with respect to biological sex groups, but not towards
individual speakers. Finally, we are the first to show that their extraordinary
success on valence is based on implicit linguistic information learnt during
fine-tuning of the transformer layers, which explains why they perform on-par
with recent multimodal approaches that explicitly utilise textual information.
Our findings collectively paint the following picture: transformer-based
architectures constitute the new state-of-the-art in SER, but further advances
are needed to mitigate remaining robustness and individual speaker issues. To
make our findings reproducible, we release the best performing model to the
community.
Related papers
- Unveiling and Mitigating Bias in Audio Visual Segmentation [9.427676046134374]
Community researchers have developed a range of advanced audio-visual segmentation models to improve the quality of sounding objects' masks.
While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic.
We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding.
arXiv Detail & Related papers (2024-07-23T16:55:04Z) - A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion Recognition [41.05066959632938]
Cross-lingual speech emotion recognition (SER) is important for a wide range of everyday applications.
We propose a novel strategy called a layer-anchoring mechanism to facilitate emotion transfer in SER tasks.
Our approach is evaluated using two distinct language affective corpora.
arXiv Detail & Related papers (2024-07-06T05:56:55Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - A Comprehensive Survey on Applications of Transformers for Deep Learning
Tasks [60.38369406877899]
Transformer is a deep neural network that employs a self-attention mechanism to comprehend the contextual relationships within sequential data.
transformer models excel in handling long dependencies between input sequence elements and enable parallel processing.
Our survey encompasses the identification of the top five application domains for transformer-based models.
arXiv Detail & Related papers (2023-06-11T23:13:51Z) - Probing Speech Emotion Recognition Transformers for Linguistic Knowledge [7.81884995637243]
We investigate the extent in which linguistic information is exploited during speech emotion recognition fine-tuning.
We synthesise prosodically neutral speech utterances while varying the sentiment of the text.
Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers.
arXiv Detail & Related papers (2022-04-01T12:47:45Z) - Transformer Uncertainty Estimation with Hierarchical Stochastic
Attention [8.95459272947319]
We propose a novel way to enable transformers to have the capability of uncertainty estimation.
This is achieved by learning a hierarchical self-attention that attends to values and a set of learnable centroids.
We empirically evaluate our model on two text classification tasks with both in-domain (ID) and out-of-domain (OOD) datasets.
arXiv Detail & Related papers (2021-12-27T16:43:31Z) - Multistage linguistic conditioning of convolutional layers for speech
emotion recognition [7.482371204083917]
We investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER)
We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN)
Experiments on the widely used IEMOCAP and MSP-Podcast databases demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline.
arXiv Detail & Related papers (2021-10-13T11:28:04Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration.
We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech.
TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.