Conversation Understanding using Relational Temporal Graph Neural
Networks with Auxiliary Cross-Modality Interaction
- URL: http://arxiv.org/abs/2311.04507v3
- Date: Tue, 30 Jan 2024 08:01:42 GMT
- Title: Conversation Understanding using Relational Temporal Graph Neural
Networks with Auxiliary Cross-Modality Interaction
- Authors: Cam-Van Thi Nguyen, Anh-Tuan Mai, The-Son Le, Hai-Dang Kieu, Duc-Trong
Le
- Abstract summary: Emotion recognition is a crucial task for human conversation understanding.
We propose an input Temporal Graph Neural Network with Cross-Modality Interaction (CORECT)
CORECT effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies.
- Score: 2.1261712640167856
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Emotion recognition is a crucial task for human conversation understanding.
It becomes more challenging with the notion of multimodal data, e.g., language,
voice, and facial expressions. As a typical solution, the global- and the local
context information are exploited to predict the emotional label for every
single sentence, i.e., utterance, in the dialogue. Specifically, the global
representation could be captured via modeling of cross-modal interactions at
the conversation level. The local one is often inferred using the temporal
information of speakers or emotional shifts, which neglects vital factors at
the utterance level. Additionally, most existing approaches take fused features
of multiple modalities in an unified input without leveraging modality-specific
representations. Motivating from these problems, we propose the Relational
Temporal Graph Neural Network with Auxiliary Cross-Modality Interaction
(CORECT), an novel neural network framework that effectively captures
conversation-level cross-modality interactions and utterance-level temporal
dependencies with the modality-specific manner for conversation understanding.
Extensive experiments demonstrate the effectiveness of CORECT via its
state-of-the-art results on the IEMOCAP and CMU-MOSEI datasets for the
multimodal ERC task.
Related papers
- Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.
We introduce a new approach that models video-text as game players using multivariate cooperative game theory.
We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z) - Effective Context Modeling Framework for Emotion Recognition in Conversations [2.7175580940471913]
Emotion Recognition in Conversations (ERC) facilitates a deeper understanding of the emotions conveyed by speakers in each utterance within a conversation.
Recent Graph Neural Networks (GNNs) have demonstrated their strengths in capturing data relationships.
We propose ConxGNN, a novel GNN-based framework designed to capture contextual information in conversations.
arXiv Detail & Related papers (2024-12-21T02:22:06Z) - SDR-GNN: Spectral Domain Reconstruction Graph Neural Network for Incomplete Multimodal Learning in Conversational Emotion Recognition [14.645598552036908]
Multimodal Emotion Recognition in Conversations (MERC) aims to classify utterance emotions using textual, auditory, and visual modal features.
Most existing MERC methods assume each utterance has complete modalities, overlooking the common issue of incomplete modalities in real-world scenarios.
We propose a Spectral Domain Reconstruction Graph Neural Network (SDR-GNN) for incomplete multimodal learning in conversational emotion recognition.
arXiv Detail & Related papers (2024-11-29T16:31:50Z) - Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations [8.107561045241445]
We propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations.
ELR-GNN achieves state-of-the-art performance on the benchmark IEMOCAP and MELD, with running times reduced by 52% and 35%, respectively.
arXiv Detail & Related papers (2024-06-27T15:54:12Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in
Group Conversations [39.79734528362605]
Multimodal Attention Network captures cross-modal interactions at various levels of spatial abstraction.
AMuSE model condenses both spatial and temporal features into two dense descriptors: speaker-level and utterance-level.
arXiv Detail & Related papers (2024-01-26T19:17:05Z) - DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition [14.639340916340801]
We propose a novel Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition (DER-GCN) method.
It models dialogue relations between speakers and captures latent event relations information.
We conduct extensive experiments on the IEMOCAP and MELD benchmark datasets, which verify the effectiveness of the DER-GCN model.
arXiv Detail & Related papers (2023-12-17T01:49:40Z) - Re-mine, Learn and Reason: Exploring the Cross-modal Semantic
Correlations for Language-guided HOI detection [57.13665112065285]
Human-Object Interaction (HOI) detection is a challenging computer vision task.
We present a framework that enhances HOI detection by incorporating structured text knowledge.
arXiv Detail & Related papers (2023-07-25T14:20:52Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Topic-Aware Multi-turn Dialogue Modeling [91.52820664879432]
This paper presents a novel solution for multi-turn dialogue modeling, which segments and extracts topic-aware utterances in an unsupervised way.
Our topic-aware modeling is implemented by a newly proposed unsupervised topic-aware segmentation algorithm and Topic-Aware Dual-attention Matching (TADAM) Network.
arXiv Detail & Related papers (2020-09-26T08:43:06Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.