Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation
- URL: http://arxiv.org/abs/2512.03521v3
- Date: Wed, 10 Dec 2025 08:01:19 GMT
- Title: Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation
- Authors: Xiaosen Lyu, Jiayu Xiong, Yuren Chen, Wanlong Wang, Xiaoqing Dai, Jing Wang,
- Abstract summary: Multimodal Emotion Recognition in Conversation aims to predict speakers' emotions by integrating textual, acoustic, and visual cues.<n>Existing approaches either struggle to capture complex cross-modal interactions or experience gradient conflicts and unstable training.<n>We propose Cross-Space Synergy (CSS), which couples a representation component with an optimization component.
- Score: 2.237205141501017
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Emotion Recognition in Conversation (MERC) aims to predict speakers' emotions by integrating textual, acoustic, and visual cues. Existing approaches either struggle to capture complex cross-modal interactions or experience gradient conflicts and unstable training when using deeper architectures. To address these issues, we propose Cross-Space Synergy (CSS), which couples a representation component with an optimization component. Synergistic Polynomial Fusion (SPF) serves the representation role, leveraging low-rank tensor factorization to efficiently capture high-order cross-modal interactions. Pareto Gradient Modulator (PGM) serves the optimization role, steering updates along Pareto-optimal directions across competing objectives to alleviate gradient conflicts and improve stability. Experiments show that CSS outperforms existing representative methods on IEMOCAP and MELD in both accuracy and training stability, demonstrating its effectiveness in complex multimodal scenarios.
Related papers
- Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z) - CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion [51.060328159429154]
Infrared and visible image fusion generates all-weather perception-capable images by combining complementary modalities.<n>We propose CtrlFuse, a controllable image fusion framework that enables interactive dynamic fusion guided by mask prompts.<n> Experiments demonstrate state-of-the-art results in both fusion controllability and segmentation accuracy, with the adapted task branch even outperforming the original segmentation model.
arXiv Detail & Related papers (2026-01-12T13:36:48Z) - Multi-label Classification with Panoptic Context Aggregation Networks [61.82285737410154]
This paper introduces the Deep Panoptic Context Aggregation Network (PanCAN), a novel approach that hierarchically integrates multi-order geometric contexts.<n>PanCAN learns multi-order neighborhood relationships at each scale by combining random walks with an attention mechanism.<n>Experiments on NUS-WIDE, PASCAL VOC,2007, and MS-COCO benchmarks demonstrate that PanCAN consistently achieves competitive results.
arXiv Detail & Related papers (2025-12-29T14:16:21Z) - MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement [29.94979992704961]
Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image.<n>Traditional CNN-based methods rely on channel-wise concatenation with fixed convolutional operators.<n>We propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening.
arXiv Detail & Related papers (2025-12-17T10:07:09Z) - Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion [7.977094562068075]
We propose Sync-TVA, an end-to-end graph-attention framework featuring modality-specific dynamic enhancement and structured cross-modal fusion.<n>Our design incorporates a dynamic enhancement module for each modality and constructs heterogeneous cross-modal graphs to model semantic relations across text, audio, and visual features.<n> Experiments on MELD and IEMOCAP demonstrate consistent improvements over state-of-the-art models in both accuracy and weighted F1 score, especially under class-imbalanced conditions.
arXiv Detail & Related papers (2025-07-29T00:03:28Z) - CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis [0.6961946145048322]
This paper introduces an end to end Contrastive Learning framework with Adaptive Multi-loss and Progressive Attention Fusion.<n>The framework is composed of three novel modules: Progressive Attention Fusion network, Multi-task Contrastive Learning, and Adaptive Multi-loss Aggregation.<n> evaluation on standard public benchmarks demonstrates that CLAMP consistently outperforms the vast majority of existing state of the art methods.
arXiv Detail & Related papers (2025-07-21T11:49:57Z) - Optimizing Speech Multi-View Feature Fusion through Conditional Computation [51.23624575321469]
Self-supervised learning (SSL) features provide lightweight and versatile multi-view speech representations.<n> SSL features conflict with traditional spectral features like FBanks in terms of update directions.<n>We propose a novel generalized feature fusion framework grounded in conditional computation.
arXiv Detail & Related papers (2025-01-14T12:12:06Z) - Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.<n>We introduce a new approach that models video-text as game players using multivariate cooperative game theory.<n>We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z) - Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment [10.278127492434297]
This paper introduces a Multi-Granularity Cross-Modal Alignment (MGCMA) framework, distinguished by its comprehensive approach encompassing distribution-based, instance-based, and token-based alignment modules.<n>Our experiments on IEMOCAP demonstrate that our proposed method outperforms current state-of-the-art techniques.
arXiv Detail & Related papers (2024-12-30T09:30:41Z) - Effective Context Modeling Framework for Emotion Recognition in Conversations [2.7175580940471913]
Emotion Recognition in Conversations (ERC) facilitates a deeper understanding of the emotions conveyed by speakers in each utterance within a conversation.<n>Recent Graph Neural Networks (GNNs) have demonstrated their strengths in capturing data relationships.<n>We propose ConxGNN, a novel GNN-based framework designed to capture contextual information in conversations.
arXiv Detail & Related papers (2024-12-21T02:22:06Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - EffMulti: Efficiently Modeling Complex Multimodal Interactions for
Emotion Analysis [8.941102352671198]
We design three kinds of latent representations to refine the emotion analysis process.
A modality-semantic hierarchical fusion is proposed to reasonably incorporate these representations into a comprehensive interaction representation.
The experimental results demonstrate that our EffMulti outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2022-12-16T03:05:55Z) - MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis [84.7287684402508]
Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high and mid-level latent modality representations.
Models of human perception highlight the importance of top-down fusion, where high-level representations affect the way sensory inputs are perceived.
We propose a neural architecture that captures top-down cross-modal interactions, using a feedback mechanism in the forward pass during network training.
arXiv Detail & Related papers (2022-01-24T17:48:04Z) - Group Gated Fusion on Attention-based Bidirectional Alignment for
Multimodal Emotion Recognition [63.07844685982738]
This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states.
We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly.
The proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
arXiv Detail & Related papers (2022-01-17T09:46:59Z) - VIRT: Improving Representation-based Models for Text Matching through
Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models.
VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.