A Change of Heart: Improving Speech Emotion Recognition through
Speech-to-Text Modality Conversion
- URL: http://arxiv.org/abs/2307.11584v1
- Date: Fri, 21 Jul 2023 13:48:11 GMT
- Title: A Change of Heart: Improving Speech Emotion Recognition through
Speech-to-Text Modality Conversion
- Authors: Zeinab Sadat Taghavi, Ali Satvaty, Hossein Sameti
- Abstract summary: We introduce a modality conversion concept aimed at enhancing emotion recognition performance on the MELD dataset.
We assess our approach through two experiments: first, a method named Modality-Conversion that employs automatic speech recognition (ASR) systems, followed by a text classifier.
Our findings indicate that the first method yields substantial results, while the second method outperforms state-of-the-art (SOTA) speech-based approaches in terms of SER weighted-F1 (WF1) score on the MELD dataset.
- Score: 0.6767885381740951
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech Emotion Recognition (SER) is a challenging task. In this paper, we
introduce a modality conversion concept aimed at enhancing emotion recognition
performance on the MELD dataset. We assess our approach through two
experiments: first, a method named Modality-Conversion that employs automatic
speech recognition (ASR) systems, followed by a text classifier; second, we
assume perfect ASR output and investigate the impact of modality conversion on
SER, this method is called Modality-Conversion++. Our findings indicate that
the first method yields substantial results, while the second method
outperforms state-of-the-art (SOTA) speech-based approaches in terms of SER
weighted-F1 (WF1) score on the MELD dataset. This research highlights the
potential of modality conversion for tasks that can be conducted in alternative
modalities.
Related papers
- Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition [1.0690007351232649]
We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent.
Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$%$ and 17.2$%$ in character error rate (CER) across multi accent test datasets.
arXiv Detail & Related papers (2024-07-03T11:35:52Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Audio-Visual Compound Expression Recognition Method based on Late Modality Fusion and Rule-based Decision [9.436107335675473]
This paper presents the results of the SUN team for the Compound Expressions Recognition Challenge of the 6th ABAW Competition.
We propose a novel audio-visual method for compound expression recognition.
arXiv Detail & Related papers (2024-03-19T12:45:52Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents.
Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity.
We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z) - Modulated Fusion using Transformer for Linguistic-Acoustic Emotion
Recognition [7.799182201815763]
This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis.
Our motivation is to propose two architectures based on Transformers and modulation that combine the linguistic and acoustic inputs from a wide range of datasets to challenge, and sometimes surpass, the state-of-the-art in the field.
arXiv Detail & Related papers (2020-10-05T14:46:20Z) - A Transfer Learning Method for Speech Emotion Recognition from Automatic
Speech Recognition [0.0]
We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture.
We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
arXiv Detail & Related papers (2020-08-06T20:37:22Z) - Constrained Variational Autoencoder for improving EEG based Speech
Recognition Systems [3.5786621294068377]
We introduce a recurrent neural network (RNN) based variational autoencoder (VAE) model with a new constrained loss function.
We demonstrate that both continuous and isolated speech recognition systems trained and tested using EEG features generated from raw EEG features.
arXiv Detail & Related papers (2020-06-01T06:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.