Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models
- URL: http://arxiv.org/abs/2202.08974v1
- Date: Wed, 16 Feb 2022 00:23:42 GMT
- Title: Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models
- Authors: Sarala Padi, Seyed Omid Sadjadi, Dinesh Manocha and Ram D. Sriram
- Abstract summary: We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
- Score: 53.31917090073727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic emotion recognition plays a key role in computer-human interaction
as it has the potential to enrich the next-generation artificial intelligence
with emotional intelligence. It finds applications in customer and/or
representative behavior analysis in call centers, gaming, personal assistants,
and social robots, to mention a few. Therefore, there has been an increasing
demand to develop robust automatic methods to analyze and recognize the various
emotions. In this paper, we propose a neural network-based emotion recognition
framework that uses a late fusion of transfer-learned and fine-tuned models
from speech and text modalities. More specifically, we i) adapt a residual
network (ResNet) based model trained on a large-scale speaker recognition task
using transfer learning along with a spectrogram augmentation approach to
recognize emotions from speech, and ii) use a fine-tuned bidirectional encoder
representations from transformers (BERT) based model to represent and recognize
emotions from the text. The proposed system then combines the ResNet and
BERT-based model scores using a late fusion strategy to further improve the
emotion recognition performance. The proposed multimodal solution addresses the
data scarcity limitation in emotion recognition using transfer learning, data
augmentation, and fine-tuning, thereby improving the generalization performance
of the emotion recognition models. We evaluate the effectiveness of our
proposed multimodal approach on the interactive emotional dyadic motion capture
(IEMOCAP) dataset. Experimental results indicate that both audio and text-based
models improve the emotion recognition performance and that the proposed
multimodal solution achieves state-of-the-art results on the IEMOCAP benchmark.
Related papers
- AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Deep Imbalanced Learning for Multimodal Emotion Recognition in
Conversations [15.705757672984662]
Multimodal Emotion Recognition in Conversations (MERC) is a significant development direction for machine intelligence.
Many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion recognition.
We propose the Class Boundary Enhanced Representation Learning (CBERL) model to address the imbalanced distribution of emotion categories in raw data.
We have conducted extensive experiments on the IEMOCAP and MELD benchmark datasets, and the results show that CBERL has achieved a certain performance improvement in the effectiveness of emotion recognition.
arXiv Detail & Related papers (2023-12-11T12:35:17Z) - A Contextualized Real-Time Multimodal Emotion Recognition for
Conversational Agents using Graph Convolutional Networks in Reinforcement
Learning [0.800062359410795]
We present a novel paradigm for contextualized Emotion Recognition using Graph Convolutional Network with Reinforcement Learning (conER-GRL)
Conversations are partitioned into smaller groups of utterances for effective extraction of contextual information.
The system uses Gated Recurrent Units (GRU) to extract multimodal features from these groups of utterances.
arXiv Detail & Related papers (2023-10-24T14:31:17Z) - EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z) - A Comparative Study of Data Augmentation Techniques for Deep Learning
Based Emotion Recognition [11.928873764689458]
We conduct a comprehensive evaluation of popular deep learning approaches for emotion recognition.
We show that long-range dependencies in the speech signal are critical for emotion recognition.
Speed/rate augmentation offers the most robust performance gain across models.
arXiv Detail & Related papers (2022-11-09T17:27:03Z) - Interpretability for Multimodal Emotion Recognition using Concept
Activation Vectors [0.0]
We address the issue of interpretability for neural networks in the context of emotion recognition using Concept Activation Vectors (CAVs)
We define human-understandable concepts specific to Emotion AI and map them to the widely-used IEMOCAP multimodal database.
We then evaluate the influence of our proposed concepts at multiple layers of the Bi-directional Contextual LSTM (BC-LSTM) network.
arXiv Detail & Related papers (2022-02-02T15:02:42Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Continuous Emotion Recognition via Deep Convolutional Autoencoder and
Support Vector Regressor [70.2226417364135]
It is crucial that the machine should be able to recognize the emotional state of the user with high accuracy.
Deep neural networks have been used with great success in recognizing emotions.
We present a new model for continuous emotion recognition based on facial expression recognition.
arXiv Detail & Related papers (2020-01-31T17:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.