Joyful: Joint Modality Fusion and Graph Contrastive Learning for
Multimodal Emotion Recognition
- URL: http://arxiv.org/abs/2311.11009v1
- Date: Sat, 18 Nov 2023 08:21:42 GMT
- Title: Joyful: Joint Modality Fusion and Graph Contrastive Learning for
Multimodal Emotion Recognition
- Authors: Dongyuan Li, Yusong Wang, Kotaro Funakoshi, and Manabu Okumura
- Abstract summary: Multimodal emotion recognition aims to recognize emotions for each utterance of multiple modalities.
Current graph-based methods fail to simultaneously depict global contextual features and local diverse uni-modal features in a dialogue.
We propose a method for joint modality fusion and graph contrastive learning for multimodal emotion recognition (Joyful)
- Score: 18.571931295274975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal emotion recognition aims to recognize emotions for each utterance
of multiple modalities, which has received increasing attention for its
application in human-machine interaction. Current graph-based methods fail to
simultaneously depict global contextual features and local diverse uni-modal
features in a dialogue. Furthermore, with the number of graph layers
increasing, they easily fall into over-smoothing. In this paper, we propose a
method for joint modality fusion and graph contrastive learning for multimodal
emotion recognition (Joyful), where multimodality fusion, contrastive learning,
and emotion recognition are jointly optimized. Specifically, we first design a
new multimodal fusion mechanism that can provide deep interaction and fusion
between the global contextual and uni-modal specific features. Then, we
introduce a graph contrastive learning framework with inter-view and intra-view
contrastive losses to learn more distinguishable representations for samples
with different sentiments. Extensive experiments on three benchmark datasets
indicate that Joyful achieved state-of-the-art (SOTA) performance compared to
all baselines.
Related papers
- Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs [13.922091192207718]
This research aims to analyze the relationship among sentences, visuals, and emoticons.
We have proposed a novel contrastive learning based multimodal architecture.
The proposed model attained an accuracy of 91% and an MCC-score of 90% while assessing emoticons.
arXiv Detail & Related papers (2024-08-05T15:45:59Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - From Text to Pixels: A Context-Aware Semantic Synergy Solution for
Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images.
Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z) - Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition [14.639340916340801]
We propose a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method.
Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces.
Secondly, we build a generator and a discriminator for the three modal features through adversarial representation.
Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information.
arXiv Detail & Related papers (2023-12-28T01:57:26Z) - Re-mine, Learn and Reason: Exploring the Cross-modal Semantic
Correlations for Language-guided HOI detection [57.13665112065285]
Human-Object Interaction (HOI) detection is a challenging computer vision task.
We present a framework that enhances HOI detection by incorporating structured text knowledge.
arXiv Detail & Related papers (2023-07-25T14:20:52Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - InterMulti:Multi-view Multimodal Interactions with Text-dominated
Hierarchical High-order Fusion for Emotion Analysis [10.048903012988882]
We propose a multimodal emotion analysis framework, InterMulti, to capture complex multimodal interactions from different views.
Our proposed framework decomposes signals of different modalities into three kinds of multimodal interaction representations.
THHF module reasonably integrates the above three kinds of representations into a comprehensive multimodal interaction representation.
arXiv Detail & Related papers (2022-12-20T07:02:32Z) - EffMulti: Efficiently Modeling Complex Multimodal Interactions for
Emotion Analysis [8.941102352671198]
We design three kinds of latent representations to refine the emotion analysis process.
A modality-semantic hierarchical fusion is proposed to reasonably incorporate these representations into a comprehensive interaction representation.
The experimental results demonstrate that our EffMulti outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2022-12-16T03:05:55Z) - Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area.
We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions.
We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z) - Multi-channel Attentive Graph Convolutional Network With Sentiment
Fusion For Multimodal Sentiment Analysis [10.625579004828733]
This paper proposes a Multi-channel Attentive Graph Convolutional Network (MAGCN)
It consists of two main components: cross-modality interactive learning and sentimental feature fusion.
Experiments are conducted on three widely-used datasets.
arXiv Detail & Related papers (2022-01-25T12:38:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.