EMERSK -- Explainable Multimodal Emotion Recognition with Situational
Knowledge
- URL: http://arxiv.org/abs/2306.08657v1
- Date: Wed, 14 Jun 2023 17:52:37 GMT
- Title: EMERSK -- Explainable Multimodal Emotion Recognition with Situational
Knowledge
- Authors: Mijanur Palash, Bharat Bhargava
- Abstract summary: We present Explainable Multimodal Emotion Recognition with Situational Knowledge (EMERSK)
EMERSK is a general system for human emotion recognition and explanation using visual information.
Our system can handle multiple modalities, including facial expressions, posture, and gait in a flexible and modular manner.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatic emotion recognition has recently gained significant attention due
to the growing popularity of deep learning algorithms. One of the primary
challenges in emotion recognition is effectively utilizing the various cues
(modalities) available in the data. Another challenge is providing a proper
explanation of the outcome of the learning.To address these challenges, we
present Explainable Multimodal Emotion Recognition with Situational Knowledge
(EMERSK), a generalized and modular system for human emotion recognition and
explanation using visual information. Our system can handle multiple
modalities, including facial expressions, posture, and gait, in a flexible and
modular manner. The network consists of different modules that can be added or
removed depending on the available data. We utilize a two-stream network
architecture with convolutional neural networks (CNNs) and encoder-decoder
style attention mechanisms to extract deep features from face images.
Similarly, CNNs and recurrent neural networks (RNNs) with Long Short-term
Memory (LSTM) are employed to extract features from posture and gait data. We
also incorporate deep features from the background as contextual information
for the learning process. The deep features from each module are fused using an
early fusion network. Furthermore, we leverage situational knowledge derived
from the location type and adjective-noun pair (ANP) extracted from the scene,
as well as the spatio-temporal average distribution of emotions, to generate
explanations. Ablation studies demonstrate that each sub-network can
independently perform emotion recognition, and combining them in a multimodal
approach significantly improves overall recognition performance. Extensive
experiments conducted on various benchmark datasets, including GroupWalk,
validate the superior performance of our approach compared to other
state-of-the-art methods.
Related papers
- Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition [53.359383163184425]
We introduce a novel multimodality synergistic knowledge distillation scheme tailored for efficient single-eye motion recognition tasks.
This method allows a lightweight, unimodal student spiking neural network (SNN) to extract rich knowledge from an event-frame multimodal teacher network.
arXiv Detail & Related papers (2024-06-20T07:24:47Z) - Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition [14.639340916340801]
We propose a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method.
Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces.
Secondly, we build a generator and a discriminator for the three modal features through adversarial representation.
Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information.
arXiv Detail & Related papers (2023-12-28T01:57:26Z) - A Contextualized Real-Time Multimodal Emotion Recognition for
Conversational Agents using Graph Convolutional Networks in Reinforcement
Learning [0.800062359410795]
We present a novel paradigm for contextualized Emotion Recognition using Graph Convolutional Network with Reinforcement Learning (conER-GRL)
Conversations are partitioned into smaller groups of utterances for effective extraction of contextual information.
The system uses Gated Recurrent Units (GRU) to extract multimodal features from these groups of utterances.
arXiv Detail & Related papers (2023-10-24T14:31:17Z) - Versatile audio-visual learning for emotion recognition [28.26077129002198]
This study proposes a versatile audio-visual learning framework for handling unimodal and multimodal systems.
We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task.
Notably, VAVL attains a new state-of-the-art performance in the emotional prediction task on the MSP-IMPROV corpus.
arXiv Detail & Related papers (2023-05-12T03:13:37Z) - GMSS: Graph-Based Multi-Task Self-Supervised Learning for EEG Emotion
Recognition [48.02958969607864]
This paper proposes a graph-based multi-task self-supervised learning model (GMSS) for EEG emotion recognition.
By learning from multiple tasks simultaneously, GMSS can find a representation that captures all of the tasks.
Experiments on SEED, SEED-IV, and MPED datasets show that the proposed model has remarkable advantages in learning more discriminative and general features for EEG emotional signals.
arXiv Detail & Related papers (2022-04-12T03:37:21Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Interpretability for Multimodal Emotion Recognition using Concept
Activation Vectors [0.0]
We address the issue of interpretability for neural networks in the context of emotion recognition using Concept Activation Vectors (CAVs)
We define human-understandable concepts specific to Emotion AI and map them to the widely-used IEMOCAP multimodal database.
We then evaluate the influence of our proposed concepts at multiple layers of the Bi-directional Contextual LSTM (BC-LSTM) network.
arXiv Detail & Related papers (2022-02-02T15:02:42Z) - Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional
Architectures in a Contextual Approach for Video-Based Visual Emotion
Recognition in the Wild [31.40575057347465]
We tackle the task of video-based visual emotion recognition in the wild.
Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction.
We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes.
arXiv Detail & Related papers (2021-05-16T17:31:59Z) - Knowledge Distillation By Sparse Representation Matching [107.87219371697063]
We propose Sparse Representation Matching (SRM) to transfer intermediate knowledge from one Convolutional Network (CNN) to another by utilizing sparse representation.
We formulate as a neural processing block, which can be efficiently optimized using gradient descent and integrated into any CNN in a plug-and-play manner.
Our experiments demonstrate that is robust to architectural differences between the teacher and student networks, and outperforms other KD techniques across several datasets.
arXiv Detail & Related papers (2021-03-31T11:47:47Z) - Continuous Emotion Recognition with Spatiotemporal Convolutional Neural
Networks [82.54695985117783]
We investigate the suitability of state-of-the-art deep learning architectures for continuous emotion recognition using long video sequences captured in-the-wild.
We have developed and evaluated convolutional recurrent neural networks combining 2D-CNNs and long short term-memory units, and inflated 3D-CNN models, which are built by inflating the weights of a pre-trained 2D-CNN model during fine-tuning.
arXiv Detail & Related papers (2020-11-18T13:42:05Z) - Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision
Action Recognition [131.6328804788164]
We propose a framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos)
The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality.
arXiv Detail & Related papers (2020-09-01T03:38:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.