InterMulti:Multi-view Multimodal Interactions with Text-dominated
Hierarchical High-order Fusion for Emotion Analysis
- URL: http://arxiv.org/abs/2212.10030v1
- Date: Tue, 20 Dec 2022 07:02:32 GMT
- Title: InterMulti:Multi-view Multimodal Interactions with Text-dominated
Hierarchical High-order Fusion for Emotion Analysis
- Authors: Feng Qiu, Wanzeng Kong, Yu Ding
- Abstract summary: We propose a multimodal emotion analysis framework, InterMulti, to capture complex multimodal interactions from different views.
Our proposed framework decomposes signals of different modalities into three kinds of multimodal interaction representations.
THHF module reasonably integrates the above three kinds of representations into a comprehensive multimodal interaction representation.
- Score: 10.048903012988882
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans are sophisticated at reading interlocutors' emotions from multimodal
signals, such as speech contents, voice tones and facial expressions. However,
machines might struggle to understand various emotions due to the difficulty of
effectively decoding emotions from the complex interactions between multimodal
signals. In this paper, we propose a multimodal emotion analysis framework,
InterMulti, to capture complex multimodal interactions from different views and
identify emotions from multimodal signals. Our proposed framework decomposes
signals of different modalities into three kinds of multimodal interaction
representations, including a modality-full interaction representation, a
modality-shared interaction representation, and three modality-specific
interaction representations. Additionally, to balance the contribution of
different modalities and learn a more informative latent interaction
representation, we developed a novel Text-dominated Hierarchical High-order
Fusion(THHF) module. THHF module reasonably integrates the above three kinds of
representations into a comprehensive multimodal interaction representation.
Extensive experimental results on widely used datasets, (i.e.) MOSEI, MOSI and
IEMOCAP, demonstrate that our method outperforms the state-of-the-art.
Related papers
- DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in
Group Conversations [39.79734528362605]
Multimodal Attention Network captures cross-modal interactions at various levels of spatial abstraction.
AMuSE model condenses both spatial and temporal features into two dense descriptors: speaker-level and utterance-level.
arXiv Detail & Related papers (2024-01-26T19:17:05Z) - Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition [14.639340916340801]
We propose a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method.
Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces.
Secondly, we build a generator and a discriminator for the three modal features through adversarial representation.
Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information.
arXiv Detail & Related papers (2023-12-28T01:57:26Z) - Joyful: Joint Modality Fusion and Graph Contrastive Learning for
Multimodal Emotion Recognition [18.571931295274975]
Multimodal emotion recognition aims to recognize emotions for each utterance of multiple modalities.
Current graph-based methods fail to simultaneously depict global contextual features and local diverse uni-modal features in a dialogue.
We propose a method for joint modality fusion and graph contrastive learning for multimodal emotion recognition (Joyful)
arXiv Detail & Related papers (2023-11-18T08:21:42Z) - MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts [92.76662894585809]
We introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE)
MMoE is able to be applied to various types of models to gain improvement.
arXiv Detail & Related papers (2023-11-16T05:31:21Z) - Multimodal Prompt Transformer with Hybrid Contrastive Learning for
Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems.
Deep emotion cues extraction was performed on modalities with strong representation ability.
Feature filters were designed as multimodal prompt information for modalities with weak representation ability.
MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - EffMulti: Efficiently Modeling Complex Multimodal Interactions for
Emotion Analysis [8.941102352671198]
We design three kinds of latent representations to refine the emotion analysis process.
A modality-semantic hierarchical fusion is proposed to reasonably incorporate these representations into a comprehensive interaction representation.
The experimental results demonstrate that our EffMulti outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2022-12-16T03:05:55Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database [139.08528216461502]
We propose a Multi-modal Multi-scene Multi-label Emotional Dialogue dataset, M3ED.
M3ED contains 990 dyadic emotional dialogues from 56 different TV series, a total of 9,082 turns and 24,449 utterances.
To the best of our knowledge, M3ED is the first multimodal emotional dialogue dataset in Chinese.
arXiv Detail & Related papers (2022-05-09T06:52:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.