Interpretable Multimodal Emotion Recognition using Hybrid Fusion of
Speech and Image Data
- URL: http://arxiv.org/abs/2208.11868v1
- Date: Thu, 25 Aug 2022 04:43:34 GMT
- Title: Interpretable Multimodal Emotion Recognition using Hybrid Fusion of
Speech and Image Data
- Authors: Puneet Kumar, Sarthak Malik and Balasubramanian Raman
- Abstract summary: A new interpretability technique has been developed to identify the important speech & image features leading to the prediction of particular emotion classes.
The proposed system has achieved 83.29% accuracy for emotion recognition.
- Score: 15.676632465869346
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a multimodal emotion recognition system based on hybrid
fusion that classifies the emotions depicted by speech utterances and
corresponding images into discrete classes. A new interpretability technique
has been developed to identify the important speech & image features leading to
the prediction of particular emotion classes. The proposed system's
architecture has been determined through intensive ablation studies. It fuses
the speech & image features and then combines speech, image, and intermediate
fusion outputs. The proposed interpretability technique incorporates the divide
& conquer approach to compute shapely values denoting each speech & image
feature's importance. We have also constructed a large-scale dataset (IIT-R
SIER dataset), consisting of speech utterances, corresponding images, and class
labels, i.e., 'anger,' 'happy,' 'hate,' and 'sad.' The proposed system has
achieved 83.29% accuracy for emotion recognition. The enhanced performance of
the proposed system advocates the importance of utilizing complementary
information from multiple modalities for emotion recognition.
Related papers
- Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition [14.639340916340801]
We propose a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method.
Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces.
Secondly, we build a generator and a discriminator for the three modal features through adversarial representation.
Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information.
arXiv Detail & Related papers (2023-12-28T01:57:26Z) - Speech and Text-Based Emotion Recognizer [0.9168634432094885]
We build a balanced corpus from publicly available datasets for speech emotion recognition.
Our best system, a multi-modal speech, and text-based model, provides a performance of UA(Unweighed Accuracy) + WA (Weighed Accuracy) of 157.57 compared to the baseline algorithm performance of 119.66.
arXiv Detail & Related papers (2023-12-10T05:17:39Z) - StyleEDL: Style-Guided High-order Attention Network for Image Emotion
Distribution Learning [69.06749934902464]
We propose a style-guided high-order attention network for image emotion distribution learning termed StyleEDL.
StyleEDL interactively learns stylistic-aware representations of images by exploring the hierarchical stylistic information of visual contents.
In addition, we introduce a stylistic graph convolutional network to dynamically generate the content-dependent emotion representations.
arXiv Detail & Related papers (2023-08-06T03:22:46Z) - Interpretable Multimodal Emotion Recognition using Facial Features and
Physiological Signals [16.549488750320336]
It introduces a multimodal framework for emotion understanding by fusing information from visual facial features and r signals extracted from the input videos.
An interpretability technique based on permutation importance analysis has also been implemented.
arXiv Detail & Related papers (2023-06-05T12:57:07Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - VISTANet: VIsual Spoken Textual Additive Net for Interpretable Multimodal Emotion Recognition [21.247650660908484]
This paper proposes a multimodal emotion recognition system, VIsual Textual Additive Net (VISTANet)
The VISTANet fuses information from image, speech, and text modalities using a hybrid of early and late fusion.
The KAAP technique computes the contribution of each modality and corresponding features toward predicting a particular emotion class.
arXiv Detail & Related papers (2022-08-24T11:35:51Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Interpretable Image Emotion Recognition: A Domain Adaptation Approach Using Facial Expressions [11.808447247077902]
This paper proposes a feature-based domain adaptation technique for identifying emotions in generic images.
It addresses the challenge of the limited availability of pre-trained models and well-annotated datasets for Image Emotion Recognition (IER)
The proposed IER system demonstrated emotion classification accuracies of 60.98% for the IAPSa dataset, 58.86% for the ArtPhoto dataset, 69.13% for the FI dataset, and 58.06% for the EMOTIC dataset.
arXiv Detail & Related papers (2020-11-17T02:55:16Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.