Persian Speech Emotion Recognition by Fine-Tuning Transformers
- URL: http://arxiv.org/abs/2402.07326v1
- Date: Sun, 11 Feb 2024 23:23:31 GMT
- Title: Persian Speech Emotion Recognition by Fine-Tuning Transformers
- Authors: Minoo Shayaninasab, Bagher Babaali
- Abstract summary: We present two models, one based on spectrograms and the other on the audio itself, fine-tuned using the shEMO dataset.
These models significantly enhance the accuracy of previous systems, increasing it from approximately 65% to 80%.
To investigate the effect of multilinguality on the fine-tuning process, these same models are fine-tuned twice.
- Score: 1.0152838128195467
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Given the significance of speech emotion recognition, numerous methods have
been developed in recent years to create effective and efficient systems in
this domain. One of these methods involves the use of pretrained transformers,
fine-tuned to address this specific problem, resulting in high accuracy.
Despite extensive discussions and global-scale efforts to enhance these
systems, the application of this innovative and effective approach has received
less attention in the context of Persian speech emotion recognition. In this
article, we review the field of speech emotion recognition and its background,
with an emphasis on the importance of employing transformers in this context.
We present two models, one based on spectrograms and the other on the audio
itself, fine-tuned using the shEMO dataset. These models significantly enhance
the accuracy of previous systems, increasing it from approximately 65% to 80%
on the mentioned dataset. Subsequently, to investigate the effect of
multilinguality on the fine-tuning process, these same models are fine-tuned
twice. First, they are fine-tuned using the English IEMOCAP dataset, and then
they are fine-tuned with the Persian shEMO dataset. This results in an improved
accuracy of 82% for the Persian emotion recognition system. Keywords: Persian
Speech Emotion Recognition, shEMO, Self-Supervised Learning
Related papers
- Speech and Text-Based Emotion Recognizer [0.9168634432094885]
We build a balanced corpus from publicly available datasets for speech emotion recognition.
Our best system, a multi-modal speech, and text-based model, provides a performance of UA(Unweighed Accuracy) + WA (Weighed Accuracy) of 157.57 compared to the baseline algorithm performance of 119.66.
arXiv Detail & Related papers (2023-12-10T05:17:39Z) - SememeASR: Boosting Performance of End-to-End Speech Recognition against
Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge [58.979490858061745]
We introduce sememe-based semantic knowledge information to speech recognition.
Our experiments show that sememe information can improve the effectiveness of speech recognition.
In addition, our further experiments show that sememe knowledge can improve the model's recognition of long-tailed data.
arXiv Detail & Related papers (2023-09-04T08:35:05Z) - Improving Code-Switching and Named Entity Recognition in ASR with Speech
Editing based Data Augmentation [22.38340990398735]
We propose a novel data augmentation method by applying the text-based speech editing model.
The experimental results on code-switching and NER tasks show that our proposed method can significantly outperform the audio splicing and neural TTS based data augmentation systems.
arXiv Detail & Related papers (2023-06-14T15:50:13Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Emotion Recognition In Persian Speech Using Deep Neural Networks [0.0]
Speech Emotion Recognition (SER) is of great importance in Human-Computer Interaction (HCI)
In this article, we examine various deep learning techniques on the SheEMO dataset.
arXiv Detail & Related papers (2022-04-28T16:02:05Z) - Probing Speech Emotion Recognition Transformers for Linguistic Knowledge [7.81884995637243]
We investigate the extent in which linguistic information is exploited during speech emotion recognition fine-tuning.
We synthesise prosodically neutral speech utterances while varying the sentiment of the text.
Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers.
arXiv Detail & Related papers (2022-04-01T12:47:45Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Knowledge Distillation from BERT Transformer to Speech Transformer for
Intent Classification [66.62686601948455]
We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model.
We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
arXiv Detail & Related papers (2021-08-05T13:08:13Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - Towards Emotion Recognition in Hindi-English Code-Mixed Data: A
Transformer Based Approach [0.0]
We present a Hinglish dataset labelled for emotion detection.
We highlight a deep learning based approach for detecting emotions in Hindi-English code mixed tweets.
arXiv Detail & Related papers (2021-02-19T14:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.