Related papers: ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge

ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge

URL: http://arxiv.org/abs/2508.05991v1
Date: Fri, 08 Aug 2025 03:55:25 GMT
Title: ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge
Authors: Juewen Hu, Yexin Li, Jiulin Li, Shuo Chen, Pring Wong,
Abstract summary: We tackle the MER-SEMI challenge of the MER2025 competition by proposing a novel multimodal emotion recognition framework.<n>We leverage large-scale pre-trained models to extract informative features from visual, audio, and textual modalities.<n>Our approach achieves a substantial performance improvement over the official baseline on the MER2025-SEMI dataset.
Score: 5.217410271468519
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Emotion recognition plays a vital role in enhancing human-computer interaction. In this study, we tackle the MER-SEMI challenge of the MER2025 competition by proposing a novel multimodal emotion recognition framework. To address the issue of data scarcity, we leverage large-scale pre-trained models to extract informative features from visual, audio, and textual modalities. Specifically, for the visual modality, we design a dual-branch visual encoder that captures both global frame-level features and localized facial representations. For the textual modality, we introduce a context-enriched method that employs large language models to enrich emotional cues within the input text. To effectively integrate these multimodal features, we propose a fusion strategy comprising two key components, i.e., self-attention mechanisms for dynamic modality weighting, and residual connections to preserve original representations. Beyond architectural design, we further refine noisy labels in the training set by a multi-source labeling strategy. Our approach achieves a substantial performance improvement over the official baseline on the MER2025-SEMI dataset, attaining a weighted F-score of 87.49% compared to 78.63%, thereby validating the effectiveness of the proposed framework.

Related papers

A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models [16.195689085967004]
We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models.<n>Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies.
arXiv Detail & Related papers (2026-01-12T14:21:32Z)
Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion [7.977094562068075]
We propose Sync-TVA, an end-to-end graph-attention framework featuring modality-specific dynamic enhancement and structured cross-modal fusion.<n>Our design incorporates a dynamic enhancement module for each modality and constructs heterogeneous cross-modal graphs to model semantic relations across text, audio, and visual features.<n> Experiments on MELD and IEMOCAP demonstrate consistent improvements over state-of-the-art models in both accuracy and weighted F1 score, especially under class-imbalanced conditions.
arXiv Detail & Related papers (2025-07-29T00:03:28Z)
Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations [1.0690007351232649]
Multimodal approaches benefit from the fusion of diverse modalities, thereby improving the recognition accuracy.<n>The proposed Qieemo framework effectively utilizes the pretrained automatic speech recognition (ASR) model which contains naturally frame aligned textual and emotional features.<n>The experimental results on the IEMOCAP dataset demonstrate that Qieemo outperforms the benchmark unimodal, multimodal, and self-supervised models with absolute improvements of 3.0%, 1.2%, and 1.9% respectively.
arXiv Detail & Related papers (2025-03-05T07:02:30Z)
RAMer: Reconstruction-based Adversarial Model for Multi-party Multi-modal Multi-label Emotion Recognition [20.12929002385256]
We propose RAMer (Reconstruction-based Adversarial Model for Emotion Recognition), which refines multi-modal representations by exploring modality commonality and specificity.<n> RAMer achieves state-of-the-art performance in dyadic and multi-party MMER scenarios.
arXiv Detail & Related papers (2025-02-09T07:46:35Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.<n>EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
DualKanbaFormer: An Efficient Selective Sparse Framework for Multimodal Aspect-based Sentiment Analysis [0.6187939267100836]
We introduce DualKanbaFormer, a novel framework that leverages parallel Textual and Visual KanbaFormer modules for robust multimodal analysis.<n>Our approach incorporates Aspect-Driven Sparse Attention (ADSA) to balance coarse-grained aggregation and fine-grained selection for aspect-focused precision.<n>We replace traditional feed-forward networks and normalization with Kolmogorov-Arnold Networks (KANs) and Dynamic Tanh (DyT) to enhance non-linear expressivity and inference stability.
arXiv Detail & Related papers (2024-08-27T19:33:15Z)
Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images. In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS) Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z)
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z)
Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z)
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training [87.69394953339238]
Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment. We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
arXiv Detail & Related papers (2024-03-01T03:25:58Z)
Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023 [51.95161901441527]
In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. Deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.
arXiv Detail & Related papers (2023-09-11T03:19:10Z)
Cross-Language Speech Emotion Recognition Using Multimodal Dual Attention Transformers [5.538923337818467]
State-of-the-art systems are unable to achieve improved performance in cross-language settings. We propose a Multimodal Dual Attention Transformer model to improve cross-language SER.
arXiv Detail & Related papers (2023-06-23T22:38:32Z)
A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition [7.80238628278552]
We propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition. To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset. The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters.
arXiv Detail & Related papers (2021-11-03T12:24:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.