Revisiting Disentanglement and Fusion on Modality and Context in
Conversational Multimodal Emotion Recognition
- URL: http://arxiv.org/abs/2308.04502v2
- Date: Sat, 12 Aug 2023 06:05:26 GMT
- Title: Revisiting Disentanglement and Fusion on Modality and Context in
Conversational Multimodal Emotion Recognition
- Authors: Bobo Li, Hao Fei, Lizi Liao, Yu Zhao, Chong Teng, Tat-Seng Chua,
Donghong Ji, Fei Li
- Abstract summary: We argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps.
We propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism ( CRM) for multimodal and context integration.
Our system achieves new state-of-the-art performance consistently.
- Score: 81.2011058113579
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: It has been a hot research topic to enable machines to understand human
emotions in multimodal contexts under dialogue scenarios, which is tasked with
multimodal emotion analysis in conversation (MM-ERC). MM-ERC has received
consistent attention in recent years, where a diverse range of methods has been
proposed for securing better task performance. Most existing works treat MM-ERC
as a standard multimodal classification problem and perform multimodal feature
disentanglement and fusion for maximizing feature utility. Yet after revisiting
the characteristic of MM-ERC, we argue that both the feature multimodality and
conversational contextualization should be properly modeled simultaneously
during the feature disentanglement and fusion steps. In this work, we target
further pushing the task performance by taking full consideration of the above
insights. On the one hand, during feature disentanglement, based on the
contrastive learning technique, we devise a Dual-level Disentanglement
Mechanism (DDM) to decouple the features into both the modality space and
utterance space. On the other hand, during the feature fusion stage, we propose
a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism
(CRM) for multimodal and context integration, respectively. They together
schedule the proper integrations of multimodal and context features.
Specifically, CFM explicitly manages the multimodal feature contributions
dynamically, while CRM flexibly coordinates the introduction of dialogue
contexts. On two public MM-ERC datasets, our system achieves new
state-of-the-art performance consistently. Further analyses demonstrate that
all our proposed mechanisms greatly facilitate the MM-ERC task by making full
use of the multimodal and context features adaptively. Note that our proposed
methods have the great potential to facilitate a broader range of other
conversational multimodal tasks.
Related papers
- Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion [14.14051929942914]
We argue that long-range contextual semantic information should be extracted in the feature disentanglement stage and the inter-modal semantic information consistency should be maximized in the feature fusion stage.
Inspired by recent State Space Models (SSMs), we propose a Broad Mamba, which does not rely on a self-attention mechanism for sequence modeling.
We show that the proposed method can overcome the computational and memory limitations of Transformer when modeling long-distance contexts.
arXiv Detail & Related papers (2024-04-27T10:22:03Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment [0.0]
This paper introduces the M&M model, a novel multimodal-multitask learning framework, applied to the AVCAffe dataset for cognitive load assessment.
M&M uniquely integrates audiovisual cues through a dual-pathway architecture, featuring specialized streams for audio and video inputs.
A key innovation lies in its cross-modality multihead attention mechanism, fusing the different modalities for synchronized multitasking.
arXiv Detail & Related papers (2024-03-14T14:49:40Z) - MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [63.78384552789171]
This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm.
We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives.
Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
arXiv Detail & Related papers (2023-12-11T13:11:04Z) - MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts [92.76662894585809]
We introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE)
MMoE is able to be applied to various types of models to gain improvement.
arXiv Detail & Related papers (2023-11-16T05:31:21Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - MCM: Multi-condition Motion Synthesis Framework for Multi-scenario [28.33039094451924]
We introduce MCM, a novel paradigm for motion synthesis that spans multiple scenarios under diverse conditions.
The MCM framework is able to integrate with any DDPM-like diffusion model to accommodate multi-conditional information input.
Our approach achieves SoTA results in both text-to-motion and competitive results in music-to-dance tasks, comparable to task-specific methods.
arXiv Detail & Related papers (2023-09-06T14:17:49Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in
Conversations [5.5997926295092295]
multimodal Emotion Recognition in Conversations (ERC) has considerable prospects for developing empathetic machines.
Recent graph-based fusion methods aggregate multimodal information by exploring unimodal and cross-modal interactions in a graph.
We propose a novel Multimodal Dynamic Fusion Network (MM-DFN) to recognize emotions by fully understanding multimodal conversational context.
arXiv Detail & Related papers (2022-03-04T15:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.