Related papers: Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimodal Emotion Recognition

Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimodal Emotion Recognition

URL: http://arxiv.org/abs/2311.11009v1
Date: Sat, 18 Nov 2023 08:21:42 GMT
Title: Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimodal Emotion Recognition
Authors: Dongyuan Li, Yusong Wang, Kotaro Funakoshi, and Manabu Okumura
Abstract summary: Multimodal emotion recognition aims to recognize emotions for each utterance of multiple modalities. Current graph-based methods fail to simultaneously depict global contextual features and local diverse uni-modal features in a dialogue. We propose a method for joint modality fusion and graph contrastive learning for multimodal emotion recognition (Joyful)
Score: 18.571931295274975
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal emotion recognition aims to recognize emotions for each utterance of multiple modalities, which has received increasing attention for its application in human-machine interaction. Current graph-based methods fail to simultaneously depict global contextual features and local diverse uni-modal features in a dialogue. Furthermore, with the number of graph layers increasing, they easily fall into over-smoothing. In this paper, we propose a method for joint modality fusion and graph contrastive learning for multimodal emotion recognition (Joyful), where multimodality fusion, contrastive learning, and emotion recognition are jointly optimized. Specifically, we first design a new multimodal fusion mechanism that can provide deep interaction and fusion between the global contextual and uni-modal specific features. Then, we introduce a graph contrastive learning framework with inter-view and intra-view contrastive losses to learn more distinguishable representations for samples with different sentiments. Extensive experiments on three benchmark datasets indicate that Joyful achieved state-of-the-art (SOTA) performance compared to all baselines.

Related papers

Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain. We introduce a new approach that models video-text as game players using multivariate cooperative game theory. We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z)
Effective Context Modeling Framework for Emotion Recognition in Conversations [2.7175580940471913]
Emotion Recognition in Conversations (ERC) facilitates a deeper understanding of the emotions conveyed by speakers in each utterance within a conversation. Recent Graph Neural Networks (GNNs) have demonstrated their strengths in capturing data relationships. We propose ConxGNN, a novel GNN-based framework designed to capture contextual information in conversations.
arXiv Detail & Related papers (2024-12-21T02:22:06Z)
WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition [2.3367170233149324]
We propose WavFusion, a multimodal speech emotion recognition framework. WavFusion addresses critical research problems in effective multimodal fusion, among modalities, and discriminative representation learning. Our work highlights the importance of capturing nuanced cross-modal interactions and learning discriminative representations for accurate multimodal SER.
arXiv Detail & Related papers (2024-12-07T06:43:39Z)
Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs [13.922091192207718]
This research aims to analyze the relationship among sentences, visuals, and emoticons. We have proposed a novel contrastive learning based multimodal architecture. The proposed model attained an accuracy of 91% and an MCC-score of 90% while assessing emoticons.
arXiv Detail & Related papers (2024-08-05T15:45:59Z)
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z)
Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z)
From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z)
Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition [14.639340916340801]
We propose a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features through adversarial representation. Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information.
arXiv Detail & Related papers (2023-12-28T01:57:26Z)
Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection [57.13665112065285]
Human-Object Interaction (HOI) detection is a challenging computer vision task. We present a framework that enhances HOI detection by incorporating structured text knowledge.
arXiv Detail & Related papers (2023-07-25T14:20:52Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
InterMulti:Multi-view Multimodal Interactions with Text-dominated Hierarchical High-order Fusion for Emotion Analysis [10.048903012988882]
We propose a multimodal emotion analysis framework, InterMulti, to capture complex multimodal interactions from different views. Our proposed framework decomposes signals of different modalities into three kinds of multimodal interaction representations. THHF module reasonably integrates the above three kinds of representations into a comprehensive multimodal interaction representation.
arXiv Detail & Related papers (2022-12-20T07:02:32Z)
EffMulti: Efficiently Modeling Complex Multimodal Interactions for Emotion Analysis [8.941102352671198]
We design three kinds of latent representations to refine the emotion analysis process. A modality-semantic hierarchical fusion is proposed to reasonably incorporate these representations into a comprehensive interaction representation. The experimental results demonstrate that our EffMulti outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2022-12-16T03:05:55Z)
Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area. We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z)
Multi-channel Attentive Graph Convolutional Network With Sentiment Fusion For Multimodal Sentiment Analysis [10.625579004828733]
This paper proposes a Multi-channel Attentive Graph Convolutional Network (MAGCN) It consists of two main components: cross-modality interactive learning and sentimental feature fusion. Experiments are conducted on three widely-used datasets.
arXiv Detail & Related papers (2022-01-25T12:38:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.