MCIHN: A Hybrid Network Model Based on Multi-path Cross-modal Interaction for Multimodal Emotion Recognition
- URL: http://arxiv.org/abs/2510.24827v1
- Date: Tue, 28 Oct 2025 16:04:03 GMT
- Title: MCIHN: A Hybrid Network Model Based on Multi-path Cross-modal Interaction for Multimodal Emotion Recognition
- Authors: Haoyang Zhang, Zhou Yang, Ke Sun, Yucai Pang, Guoliang Xu,
- Abstract summary: We propose a hybrid network model based on multipath cross-modal interaction (MCIHN)<n> adversarial autoencoders (AAE) are constructed separately for each modality.<n> latent codes are fed into a predefined Cross-modal Gate Mechanism model (CGMM)
- Score: 7.944119407791842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal emotion recognition is crucial for future human-computer interaction. However, accurate emotion recognition still faces significant challenges due to differences between different modalities and the difficulty of characterizing unimodal emotional information. To solve these problems, a hybrid network model based on multipath cross-modal interaction (MCIHN) is proposed. First, adversarial autoencoders (AAE) are constructed separately for each modality. The AAE learns discriminative emotion features and reconstructs the features through a decoder to obtain more discriminative information about the emotion classes. Then, the latent codes from the AAE of different modalities are fed into a predefined Cross-modal Gate Mechanism model (CGMM) to reduce the discrepancy between modalities, establish the emotional relationship between interacting modalities, and generate the interaction features between different modalities. Multimodal fusion using the Feature Fusion module (FFM) for better emotion recognition. Experiments were conducted on publicly available SIMS and MOSI datasets, demonstrating that MCIHN achieves superior performance.
Related papers
- Memory-guided Prototypical Co-occurrence Learning for Mixed Emotion Recognition [56.00118641432005]
We propose a Memory-guided Prototypical Co-occurrence Learning framework that explicitly models emotion co-occurrence patterns.<n>Inspired by human cognitive memory systems, we introduce a memory retrieval strategy to extract semantic-level co-occurrence associations.<n>Our model learns affectively informative representations for accurate emotion distribution prediction.
arXiv Detail & Related papers (2026-02-24T04:11:25Z) - MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition [8.732416479560605]
This paper proposes Multimodal Cross-Attention Network and Contrastive Learning (MCN-CL) for multimodal emotion recognition.<n>It uses a triple query mechanism and hard negative mining strategy to remove feature redundancy while preserving important emotional cues.<n>Experiment results on the IEMOCAP and MELD datasets show that our proposed method outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2025-11-14T02:13:31Z) - GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints [24.242098942377574]
Multimodal emotion recognition (MER) extracts emotions from multimodal data, including visual, speech, and text inputs, playing a key role in human-computer interaction.<n>We propose a gated interactive attention mechanism to adaptively extract modality-specific features while enhancing emotional information through pairwise interactions.<n> Experiments on IEMOCAP demonstrate that our method outperforms state-of-the-art MER approaches, achieving WA 80.7% and UA 81.3%.
arXiv Detail & Related papers (2025-06-01T07:07:02Z) - Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning [40.101313334772016]
The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information.
Previous ERC methods relied on simple connections for cross-modal fusion.
We propose a cross-modal fusion emotion prediction network based on vector connections.
arXiv Detail & Related papers (2024-05-28T07:22:30Z) - Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD)
It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images.
A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition [14.639340916340801]
We propose a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method.
Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces.
Secondly, we build a generator and a discriminator for the three modal features through adversarial representation.
Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information.
arXiv Detail & Related papers (2023-12-28T01:57:26Z) - CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition [34.24557248359872]
We propose a Cross-modal Fusion Network with Emotion-Shift Awareness (CFN-ESA) for emotion recognition in conversation.
CFN-ESA consists of unimodal encoder (RUME), cross-modal encoder (ACME), and emotion-shift module (LESM)
arXiv Detail & Related papers (2023-07-28T09:29:42Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Group Gated Fusion on Attention-based Bidirectional Alignment for
Multimodal Emotion Recognition [63.07844685982738]
This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states.
We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly.
The proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
arXiv Detail & Related papers (2022-01-17T09:46:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.