Related papers: Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning

Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning

URL: http://arxiv.org/abs/2508.01181v2
Date: Sat, 11 Oct 2025 07:15:43 GMT
Title: Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning
Authors: Zhiyuan Han, Beier Zhu, Yanlong Xu, Peipei Song, Xun Yang,
Abstract summary: We introduce CA-MER, a new benchmark designed to examine MLLMs under realistic emotion conflicts.<n> evaluations reveal that current state-of-the-art emotion MLLMs systematically over-rely on audio signal during emotion conflicts.<n>We propose MoSEAR, a parameter-efficient framework that promotes balanced modality integration.
Score: 21.344503400857107
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite their strong performance in multimodal emotion reasoning, existing Multimodal Large Language Models (MLLMs) often overlook the scenarios involving emotion conflicts, where emotional cues from different modalities are inconsistent. To fill this gap, we first introduce CA-MER, a new benchmark designed to examine MLLMs under realistic emotion conflicts. It consists of three subsets: video-aligned, audio-aligned, and consistent, where only one or all modalities reflect the true emotion. However, evaluations on our CA-MER reveal that current state-of-the-art emotion MLLMs systematically over-rely on audio signal during emotion conflicts, neglecting critical cues from visual modality. To mitigate this bias, we propose MoSEAR, a parameter-efficient framework that promotes balanced modality integration. MoSEAR consists of two modules: (1)MoSE, modality-specific experts with a regularized gating mechanism that reduces modality bias in the fine-tuning heads; and (2)AR, an attention reallocation mechanism that rebalances modality contributions in frozen backbones during inference. Our framework offers two key advantages: it mitigates emotion conflicts and improves performance on consistent samples-without incurring a trade-off between audio and visual modalities. Experiments on multiple benchmarks-including MER2023, EMER, DFEW, and our CA-MER-demonstrate that MoSEAR achieves state-of-the-art performance, particularly under modality conflict conditions.

Related papers

Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning [9.470507126417292]
We introduce SABER-LLM, a framework designed for robust multimodal reasoning.<n>First, we construct SABER, a large-scale emotion reasoning dataset comprising 600K video clips.<n>Second, we propose the structured evidence decomposition paradigm, which enforces a "perceive-then-reason" separation between evidence extraction and reasoning.
arXiv Detail & Related papers (2026-01-26T10:03:26Z)
Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding [45.13650362585136]
We present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning.<n>An end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens.<n>A perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning.
arXiv Detail & Related papers (2026-01-23T05:02:43Z)
TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition [31.4260327895046]
Multimodal Emotion Recognition aims to accurately identify human emotional states by integrating heterogeneous modalities such as visual, auditory, and textual data.<n>Existing approaches predominantly rely on unified emotion labels to supervise model training, often overlooking a critical challenge: inter-modal emotion conflicts.<n>We propose Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL), inspired by the stage-wise nature of human emotion perception.
arXiv Detail & Related papers (2025-11-19T03:49:22Z)
Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations [94.62792643569567]
This work systematically investigates the role of speaker emotion.<n>We construct a dataset of malicious speech instructions expressed across multiple emotions and intensities, and evaluate several state-of-the-art LALMs.<n>Our results reveal substantial safety inconsistencies: different emotions elicit varying levels of unsafe responses, and the effect of intensity is non-monotonic, with medium expressions often posing the greatest risk.
arXiv Detail & Related papers (2025-10-19T15:41:25Z)
MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models [108.61337743051483]
We present MME-Emotion, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs.<n>MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks.<n>It incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework.
arXiv Detail & Related papers (2025-08-11T03:14:55Z)
Robust Multimodal Large Language Models Against Modality Conflict [94.12341487880465]
multimodal large language models (MLLMs) are prone to hallucinations in real-world scenarios.<n>We study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations.<n>Three methods are proposed to alleviate the hallucination caused by modality conflict.
arXiv Detail & Related papers (2025-07-09T11:18:38Z)
GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations [35.63053777817013]
GatedxLSTM is a novel multimodal Emotion Recognition in Conversation (ERC) model.<n>It considers voice and transcripts of both the speaker and their conversational partner to identify the most influential sentences driving emotional shifts.<n>It achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification.
arXiv Detail & Related papers (2025-03-26T18:46:18Z)
RAMer: Reconstruction-based Adversarial Model for Multi-party Multi-modal Multi-label Emotion Recognition [10.994464649878926]
We propose RAMer (Reconstruction-based Adrial Model for Emotion Recognition) to refine multi-modal representations.<n>We show that RAMer achieves state-of-the-art performance in dyadic and multi-party MMER scenarios.
arXiv Detail & Related papers (2025-02-09T07:46:35Z)
Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition [37.12407597998884]
A novel approach named GraphSmile is proposed for tracking intricate emotional cues in multimodal dialogues.<n>GraphSmile comprises two key components, i.e., GSF and SDP modules.<n> Empirical results on multiple benchmarks demonstrate that GraphSmile can handle complex emotional and sentimental patterns.
arXiv Detail & Related papers (2024-07-31T11:47:36Z)
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning [55.127202990679976]
We introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. We propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders.
arXiv Detail & Related papers (2024-06-17T03:01:22Z)
UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts [64.02363948840333]
UMETTS is a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.<n>EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information.<n>EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z)
UniMEEC: Towards Unified Multimodal Emotion Recognition and Emotion Cause [18.99103120856208]
We propose a Unified Multimodal Emotion recognition and Emotion-Cause analysis framework (UniMEEC) to explore the causality between emotion and emotion cause. UniMEEC reformulates the MERC and MECPE tasks as mask prediction problems and unifies them with a causal prompt template. Experiment results on four public benchmark datasets verify the model performance on MERC and MECPE tasks.
arXiv Detail & Related papers (2024-03-30T15:59:17Z)
MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition [7.81011775615268]
We introduce MSAC-SERNet, a novel unified SER framework capable of simultaneously handling both single-corpus and cross-corpus SER. Considering information overlap between various speech attributes, we propose a novel learning paradigm based on correlations of different speech attributes. Experiments on both single-corpus and cross-corpus SER scenarios indicate that MSAC-SERNet achieves superior performance compared to state-of-the-art SER approaches.
arXiv Detail & Related papers (2023-08-08T03:43:24Z)
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model. HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z)
TSAM: A Two-Stream Attention Model for Causal Emotion Entailment [50.07800752967995]
Causal Emotion Entailment (CEE) aims to discover the potential causes behind an emotion in a conversational utterance. We classify multiple utterances synchronously to capture the correlations between utterances in a global view. We propose a Two-Stream Attention Model (TSAM) to effectively model the speaker's emotional influences in the conversational history.
arXiv Detail & Related papers (2022-03-02T02:11:41Z)
Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition [55.44502358463217]
We propose a modality-transferable model with emotion embeddings to tackle the aforementioned issues. Our model achieves state-of-the-art performance on most of the emotion categories. Our model also outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions.
arXiv Detail & Related papers (2020-09-21T06:10:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.