Emotional Speech Recognition with Pre-trained Deep Visual Models
- URL: http://arxiv.org/abs/2204.03561v1
- Date: Wed, 6 Apr 2022 11:27:59 GMT
- Title: Emotional Speech Recognition with Pre-trained Deep Visual Models
- Authors: Waleed Ragheb, Mehdi Mirzapour, Ali Delfardi, H\'el\`ene Jacquenet,
Lawrence Carbon
- Abstract summary: We employ the transfer learning capabilities of the pre-trained computer vision deep models to have a mandate for the emotion recognition in speech task.
In our experiments, we use the pre-trained VGG-16 model and test the overall methodology on the Berlin EMO-DB dataset for speaker-independent emotion recognition.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In this paper, we propose a new methodology for emotional speech recognition
using visual deep neural network models. We employ the transfer learning
capabilities of the pre-trained computer vision deep models to have a mandate
for the emotion recognition in speech task. In order to achieve that, we
propose to use a composite set of acoustic features and a procedure to convert
them into images. Besides, we present a training paradigm for these models
taking into consideration the different characteristics between acoustic-based
images and regular ones. In our experiments, we use the pre-trained VGG-16
model and test the overall methodology on the Berlin EMO-DB dataset for
speaker-independent emotion recognition. We evaluate the proposed model on the
full list of the seven emotions and the results set a new state-of-the-art.
Related papers
- Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Alleviating Catastrophic Forgetting in Facial Expression Recognition with Emotion-Centered Models [49.3179290313959]
The proposed method, emotion-centered generative replay (ECgr), tackles this challenge by integrating synthetic images from generative adversarial networks.
ECgr incorporates a quality assurance algorithm to ensure the fidelity of generated images.
The experimental results on four diverse facial expression datasets demonstrate that incorporating images generated by our pseudo-rehearsal method enhances training on the targeted dataset and the source dataset.
arXiv Detail & Related papers (2024-04-18T15:28:34Z) - Music Recommendation Based on Facial Emotion Recognition [0.0]
This paper presents a comprehensive approach to enhancing the user experience through the integration of emotion recognition, music recommendation, and explainable AI using GRAD-CAM.
The proposed methodology utilizes a ResNet50 model trained on the Facial Expression Recognition dataset, consisting of real images of individuals expressing various emotions.
arXiv Detail & Related papers (2024-04-06T15:14:25Z) - Unsupervised Representations Improve Supervised Learning in Speech
Emotion Recognition [1.3812010983144798]
This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments.
In the preprocessing step, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data.
Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification.
arXiv Detail & Related papers (2023-09-22T08:54:06Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings [16.829474982595837]
We propose a transfer learning method for speech emotion recognition.
We combine the output of several layers from the pre-trained model using trainable weights which are learned jointly with the downstream model.
We evaluate our proposed approaches on two standard emotion databases IEMOCAP and RAVDESS, showing superior performance compared to results in the literature.
arXiv Detail & Related papers (2021-04-08T04:31:58Z) - Leveraging Recent Advances in Deep Learning for Audio-Visual Emotion
Recognition [2.1485350418225244]
Spontaneous multi-modal emotion recognition has been extensively studied for human behavior analysis.
We propose a new deep learning-based approach for audio-visual emotion recognition.
arXiv Detail & Related papers (2021-03-16T15:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.