Decoding Emotions: A comprehensive Multilingual Study of Speech Models
for Speech Emotion Recognition
- URL: http://arxiv.org/abs/2308.08713v1
- Date: Thu, 17 Aug 2023 00:30:56 GMT
- Title: Decoding Emotions: A comprehensive Multilingual Study of Speech Models
for Speech Emotion Recognition
- Authors: Anant Singh and Akshat Gupta
- Abstract summary: This article presents a comprehensive benchmark for speech emotion recognition with eight speech representation models and six different languages.
We find that using features from a single optimal layer of a speech model reduces the error rate by 32% on average across seven datasets.
Our probing results indicate that the middle layers of speech models capture the most important emotional information for speech emotion recognition.
- Score: 3.4111723103928173
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in transformer-based speech representation models have
greatly transformed speech processing. However, there has been limited research
conducted on evaluating these models for speech emotion recognition (SER)
across multiple languages and examining their internal representations. This
article addresses these gaps by presenting a comprehensive benchmark for SER
with eight speech representation models and six different languages. We
conducted probing experiments to gain insights into inner workings of these
models for SER. We find that using features from a single optimal layer of a
speech model reduces the error rate by 32\% on average across seven datasets
when compared to systems where features from all layers of speech models are
used. We also achieve state-of-the-art results for German and Persian
languages. Our probing results indicate that the middle layers of speech models
capture the most important emotional information for speech emotion
recognition.
Related papers
- Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities [9.473861847584843]
We present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper.
We investigate the models' predictive ability to transcribe Indonesian speech data across different variability groups.
arXiv Detail & Related papers (2024-10-11T14:07:07Z) - Adapting WavLM for Speech Emotion Recognition [0.0]
We explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus.
We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.
arXiv Detail & Related papers (2024-05-07T16:53:42Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Multilingual Speech Emotion Recognition With Multi-Gating Mechanism and
Neural Architecture Search [15.51730246937201]
Speech emotion recognition (SER) classifies audio into emotion categories such as Happy, Angry, Fear, Disgust and Neutral.
This paper proposes a language-specific model that extract emotional information from multiple pre-trained speech models.
Our model raises the state-of-the-art accuracy by 3% for German and 14.3% for French.
arXiv Detail & Related papers (2022-10-31T19:55:33Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features.
We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models.
Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.