DRKF: Decoupled Representations with Knowledge Fusion for Multimodal Emotion Recognition
- URL: http://arxiv.org/abs/2508.01644v1
- Date: Sun, 03 Aug 2025 08:05:57 GMT
- Title: DRKF: Decoupled Representations with Knowledge Fusion for Multimodal Emotion Recognition
- Authors: Peiyuan Jiang, Yao Liu, Qiao Liu, Zongshun Zhang, Jiaye Yang, Lu Liu, Daibing Yao,
- Abstract summary: We propose a Decoupled Representations with Knowledge Fusion (DRKF) method for multimodal emotion recognition.<n>DRKF consists of two main modules: an Optimized Representation Learning (ORL) Module and a Knowledge Fusion (KF) Module.<n>Experiments show that DRKF achieves state-of-the-art (SOTA) performance on IEMOCAP, MELD, and M3ED.
- Score: 5.765485747592163
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal emotion recognition (MER) aims to identify emotional states by integrating and analyzing information from multiple modalities. However, inherent modality heterogeneity and inconsistencies in emotional cues remain key challenges that hinder performance. To address these issues, we propose a Decoupled Representations with Knowledge Fusion (DRKF) method for MER. DRKF consists of two main modules: an Optimized Representation Learning (ORL) Module and a Knowledge Fusion (KF) Module. ORL employs a contrastive mutual information estimation method with progressive modality augmentation to decouple task-relevant shared representations and modality-specific features while mitigating modality heterogeneity. KF includes a lightweight self-attention-based Fusion Encoder (FE) that identifies the dominant modality and integrates emotional information from other modalities to enhance the fused representation. To handle potential errors from incorrect dominant modality selection under emotionally inconsistent conditions, we introduce an Emotion Discrimination Submodule (ED), which enforces the fused representation to retain discriminative cues of emotional inconsistency. This ensures that even if the FE selects an inappropriate dominant modality, the Emotion Classification Submodule (EC) can still make accurate predictions by leveraging preserved inconsistency information. Experiments show that DRKF achieves state-of-the-art (SOTA) performance on IEMOCAP, MELD, and M3ED. The source code is publicly available at https://github.com/PANPANKK/DRKF.
Related papers
- Divide and Refine: Enhancing Multimodal Representation and Explainability for Emotion Recognition in Conversation [2.5884126726585777]
Multimodal emotion recognition in conversation requires representations that integrate signals from multiple modalities.<n>Recent advances in contrastive learning and augmentation-based methods have made progress, but they often overlook the role of data preparation in preserving these components.<n>We propose a two-phase framework emphtextbfDivide and textbfRefine (textbfDnR)<n>These results highlight the effectiveness of explicitly dividing, refining, and recombining multimodal representations as a principled strategy for advancing emotion recognition.
arXiv Detail & Related papers (2026-01-10T07:30:20Z) - CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification [77.07028925223383]
Lifelong person Re-IDentification aims to match the same person employing continuously collected individual data from different scenarios.<n>To achieve continuous all-day person matching across day and night, Visible-Infrared Lifelong person Re-IDentification (VI-LReID) focuses on sequential training on data from visible and infrared modalities.<n>Existing methods typically exploit cross-modal knowledge distillation to alleviate the catastrophic forgetting of old knowledge.
arXiv Detail & Related papers (2025-11-19T01:30:29Z) - Federated Dialogue-Semantic Diffusion for Emotion Recognition under Incomplete Modalities [13.098852116759929]
We propose the Federated Dialogue-guided and Semantic-Consistent Diffusion (FedDISC) framework for missing-modality recovery.<n>FedDISC overcomes single-client reliance on modality completeness by federated aggregation of modality-specific diffusion models trained on clients.<n>Experiments on the IEMOCAP, CMUMOSI, and CMUMOSEI datasets demonstrate that FedDISC achieves superior emotion classification performance across diverse missing modality patterns.
arXiv Detail & Related papers (2025-11-01T01:00:06Z) - Latent Distribution Decoupling: A Probabilistic Framework for Uncertainty-Aware Multimodal Emotion Recognition [7.25361375272096]
Multimodal multi-label emotion recognition aims to identify the concurrent presence of multiple emotions in multimodal data.<n>Existing studies overlook the impact of textbfaleatoric uncertainty, which is the inherent noise in the multimodal data.<n>This paper proposes Latent emotional Distribution Decomposition with Uncertainty perception framework.
arXiv Detail & Related papers (2025-02-19T18:53:23Z) - Progressively Modality Freezing for Multi-Modal Entity Alignment [27.77877721548588]
We propose a novel strategy of progressive modality freezing, called PMF, that focuses on alignmentrelevant features.
Notably, our approach introduces a pioneering cross-modal association loss to foster modal consistency.
Empirical evaluations across nine datasets confirm PMF's superiority.
arXiv Detail & Related papers (2024-07-23T04:22:30Z) - Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD)
It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images.
A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition [14.639340916340801]
We propose a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method.
Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces.
Secondly, we build a generator and a discriminator for the three modal features through adversarial representation.
Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information.
arXiv Detail & Related papers (2023-12-28T01:57:26Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - SFusion: Self-attention based N-to-One Multimodal Fusion Block [6.059397373352718]
We propose a self-attention based fusion block called SFusion.
It learns to fuse available modalities without synthesizing or zero-padding missing ones.
In this work, we apply SFusion to different backbone networks for human activity recognition and brain tumor segmentation tasks.
arXiv Detail & Related papers (2022-08-26T16:42:14Z) - A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition [46.443866373546726]
We focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos.
We propose a joint cross-attention model that relies on the complementary relationships to extract the salient features.
Our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-28T14:09:43Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.