AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in
Group Conversations
- URL: http://arxiv.org/abs/2401.15164v1
- Date: Fri, 26 Jan 2024 19:17:05 GMT
- Title: AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in
Group Conversations
- Authors: Naresh Kumar Devulapally, Sidharth Anand, Sreyasee Das Bhattacharjee,
Junsong Yuan, Yu-Ping Chang
- Abstract summary: Multimodal Attention Network captures cross-modal interactions at various levels of spatial abstraction.
AMuSE model condenses both spatial and temporal features into two dense descriptors: speaker-level and utterance-level.
- Score: 39.79734528362605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Analyzing individual emotions during group conversation is crucial in
developing intelligent agents capable of natural human-machine interaction.
While reliable emotion recognition techniques depend on different modalities
(text, audio, video), the inherent heterogeneity between these modalities and
the dynamic cross-modal interactions influenced by an individual's unique
behavioral patterns make the task of emotion recognition very challenging. This
difficulty is compounded in group settings, where the emotion and its temporal
evolution are not only influenced by the individual but also by external
contexts like audience reaction and context of the ongoing conversation. To
meet this challenge, we propose a Multimodal Attention Network that captures
cross-modal interactions at various levels of spatial abstraction by jointly
learning its interactive bunch of mode-specific Peripheral and Central
networks. The proposed MAN injects cross-modal attention via its Peripheral
key-value pairs within each layer of a mode-specific Central query network. The
resulting cross-attended mode-specific descriptors are then combined using an
Adaptive Fusion technique that enables the model to integrate the
discriminative and complementary mode-specific data patterns within an
instance-specific multimodal descriptor. Given a dialogue represented by a
sequence of utterances, the proposed AMuSE model condenses both spatial and
temporal features into two dense descriptors: speaker-level and
utterance-level. This helps not only in delivering better classification
performance (3-5% improvement in Weighted-F1 and 5-7% improvement in Accuracy)
in large-scale public datasets but also helps the users in understanding the
reasoning behind each emotion prediction made by the model via its Multimodal
Explainability Visualization module.
Related papers
- DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - AM^2-EmoJE: Adaptive Missing-Modality Emotion Recognition in
Conversation via Joint Embedding Learning [42.69642087199678]
We propose AM2-EmoJE, a model for Adaptive Missing-Modality Emotion Recognition in Conversation via Joint Embedding Learning model.
By leveraging thetemporal details at the dialogue level, the proposed AM2-EmoJE not only demonstrates superior performance compared to the best-performing state-of-the-art multimodal methods.
arXiv Detail & Related papers (2024-01-26T19:57:26Z) - Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition [14.639340916340801]
We propose a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method.
Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces.
Secondly, we build a generator and a discriminator for the three modal features through adversarial representation.
Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information.
arXiv Detail & Related papers (2023-12-28T01:57:26Z) - Conversation Understanding using Relational Temporal Graph Neural
Networks with Auxiliary Cross-Modality Interaction [2.1261712640167856]
Emotion recognition is a crucial task for human conversation understanding.
We propose an input Temporal Graph Neural Network with Cross-Modality Interaction (CORECT)
CORECT effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies.
arXiv Detail & Related papers (2023-11-08T07:46:25Z) - A Transformer-Based Model With Self-Distillation for Multimodal Emotion
Recognition in Conversations [15.77747948751497]
We propose a transformer-based model with self-distillation (SDT) for the task.
The proposed model captures intra- and inter-modal interactions by utilizing intra- and inter-modal transformers.
We introduce self-distillation to transfer knowledge of hard and soft labels from the proposed model to each modality.
arXiv Detail & Related papers (2023-10-31T14:33:30Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion
Recognition [41.837538440839815]
We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition.
The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model.
In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer.
arXiv Detail & Related papers (2023-04-14T03:25:00Z) - High-Modality Multimodal Transformer: Quantifying Modality & Interaction
Heterogeneity for High-Modality Representation Learning [112.51498431119616]
This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities.
A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
arXiv Detail & Related papers (2022-03-02T18:56:20Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.