Related papers: HeLo: Heterogeneous Multi-Modal Fusion with Label Correlation for Emotion Distribution Learning

HeLo: Heterogeneous Multi-Modal Fusion with Label Correlation for Emotion Distribution Learning

URL: http://arxiv.org/abs/2507.06821v3
Date: Sat, 26 Jul 2025 16:52:02 GMT
Title: HeLo: Heterogeneous Multi-Modal Fusion with Label Correlation for Emotion Distribution Learning
Authors: Chuhang Zheng, Chunwei Tian, Jie Wen, Daoqiang Zhang, Qi Zhu,
Abstract summary: We propose a multi-modal emotion distribution learning framework, named HeLo, to explore the heterogeneity and complementary information in multi-modal emotional data.<n> Experimental results on two publicly available datasets demonstrate the superiority of our proposed method in emotion distribution learning.
Score: 25.95933218051548
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal emotion recognition has garnered increasing attention as it plays a significant role in human-computer interaction (HCI) in recent years. Since different discrete emotions may exist at the same time, compared with single-class emotion recognition, emotion distribution learning (EDL) that identifies a mixture of basic emotions has gradually emerged as a trend. However, existing EDL methods face challenges in mining the heterogeneity among multiple modalities. Besides, rich semantic correlations across arbitrary basic emotions are not fully exploited. In this paper, we propose a multi-modal emotion distribution learning framework, named HeLo, aimed at fully exploring the heterogeneity and complementary information in multi-modal emotional data and label correlation within mixed basic emotions. Specifically, we first adopt cross-attention to effectively fuse the physiological data. Then, an optimal transport (OT)-based heterogeneity mining module is devised to mine the interaction and heterogeneity between the physiological and behavioral representations. To facilitate label correlation learning, we introduce a learnable label embedding optimized by correlation matrix alignment. Finally, the learnable label embeddings and label correlation matrices are integrated with the multi-modal representations through a novel label correlation-driven cross-attention mechanism for accurate emotion distribution learning. Experimental results on two publicly available datasets demonstrate the superiority of our proposed method in emotion distribution learning.

Related papers

UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries [61.5273479616832]
We propose a unified framework that seamlessly integrates emotional understanding and generation.<n>We show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks.
arXiv Detail & Related papers (2025-07-31T09:39:27Z)
Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics.<n>We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention.<n>Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks.<n>Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z)
Hierarchical Hypercomplex Network for Multimodal Emotion Recognition [9.54382727022316]
We introduce a fully hypercomplex network with a hierarchical learning structure to fully capture correlations. The proposed architecture surpasses state-of-the-art models on the MAHNOB-HCI dataset for emotion recognition.
arXiv Detail & Related papers (2024-09-13T21:07:49Z)
Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition [14.639340916340801]
We propose a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features through adversarial representation. Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information.
arXiv Detail & Related papers (2023-12-28T01:57:26Z)
A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts. To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules. The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z)
Seeking Subjectivity in Visual Emotion Distribution Learning [93.96205258496697]
Visual Emotion Analysis (VEA) aims to predict people's emotions towards different visual stimuli. Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process. We propose a novel textitSubjectivity Appraise-and-Match Network (SAMNet) to investigate the subjectivity in visual emotion distribution.
arXiv Detail & Related papers (2022-07-25T02:20:03Z)
Attentive Cross-modal Connections for Deep Multimodal Wearable-based Emotion Recognition [7.559720049837459]
We present a novel attentive cross-modal connection to share information between convolutional neural networks. Specifically, these connections improve emotion classification by sharing intermediate representations among EDA and ECG. Our experiments show that the proposed approach is capable of learning strong multimodal representations and outperforms a number of baselines methods.
arXiv Detail & Related papers (2021-08-04T18:40:32Z)
Label Distribution Amendment with Emotional Semantic Correlations for Facial Expression Recognition [69.18918567657757]
We propose a new method that amends the label distribution of each facial image by leveraging correlations among expressions in the semantic space. By comparing semantic and task class-relation graphs of each image, the confidence of its label distribution is evaluated. Experimental results demonstrate the proposed method is more effective than compared state-of-the-art methods.
arXiv Detail & Related papers (2021-07-23T07:46:14Z)
A Circular-Structured Representation for Visual Emotion Distribution Learning [82.89776298753661]
We propose a well-grounded circular-structured representation to utilize the prior knowledge for visual emotion distribution learning. To be specific, we first construct an Emotion Circle to unify any emotional state within it. On the proposed Emotion Circle, each emotion distribution is represented with an emotion vector, which is defined with three attributes.
arXiv Detail & Related papers (2021-06-23T14:53:27Z)
SpanEmo: Casting Multi-label Emotion Classification as Span-prediction [15.41237087996244]
We propose a new model "SpanEmo" casting multi-label emotion classification as span-prediction. We introduce a loss function focused on modelling multiple co-existing emotions in the input sentence. Experiments performed on the SemEval2018 multi-label emotion data over three language sets demonstrate our method's effectiveness.
arXiv Detail & Related papers (2021-01-25T12:11:04Z)
EmoGraph: Capturing Emotion Correlations using Graph Networks [71.53159402053392]
We propose EmoGraph that captures the dependencies among different emotions through graph networks. EmoGraph outperforms strong baselines, especially for macro-F1. An experiment illustrates the captured emotion correlations can also benefit a single-label classification task.
arXiv Detail & Related papers (2020-08-21T08:59:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.