Related papers: Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

URL: http://arxiv.org/abs/2405.17900v1
Date: Tue, 28 May 2024 07:22:30 GMT
Title: Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning
Authors: Haoxiang Shi, Xulong Zhang, Ning Cheng, Yong Zhang, Jun Yu, Jing Xiao, Jianzong Wang,
Abstract summary: The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion. We propose a cross-modal fusion emotion prediction network based on vector connections.
Score: 40.101313334772016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion and ignored the information differences between modalities, resulting in the model being unable to focus on modality-specific emotional information. At the same time, the shared information between modalities was not processed to generate emotions. Information redundancy problem. To overcome these limitations, we propose a cross-modal fusion emotion prediction network based on vector connections. The network mainly includes two stages: the multi-modal feature fusion stage based on connection vectors and the emotion classification stage based on fused features. Furthermore, we design a supervised inter-class contrastive learning module based on emotion labels. Experimental results confirm the effectiveness of the proposed method, demonstrating excellent performance on the IEMOCAP and MELD datasets.

Related papers

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition [16.97833694961584]
Foal-Net is designed to enhance the effectiveness of modality fusion. It includes two auxiliary tasks: audio-video emotion alignment and cross-modal emotion label matching. Experiments show that Foal-Net outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-08-18T11:05:21Z)
Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer [78.35816158511523]
We present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT) for simultaneous subject localization and emotion classification. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC.
arXiv Detail & Related papers (2024-04-26T07:30:32Z)
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z)
Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition [14.639340916340801]
We propose a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features through adversarial representation. Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information.
arXiv Detail & Related papers (2023-12-28T01:57:26Z)
CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition [34.24557248359872]
We propose a Cross-modal Fusion Network with Emotion-Shift Awareness (CFN-ESA) for emotion recognition in conversation. CFN-ESA consists of unimodal encoder (RUME), cross-modal encoder (ACME), and emotion-shift module (LESM)
arXiv Detail & Related papers (2023-07-28T09:29:42Z)
EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation [34.24557248359872]
We propose an emotional inertia and contagion-driven dependency modeling approach (EmotionIC) for ERC task. Our EmotionIC consists of three main components, i.e., Identity Masked Multi-Head Attention (IMMHA), Dialogue-based Gated Recurrent Unit (DiaGRU) and Skip-chain Conditional Random Field (SkipCRF) Experimental results show that our method can significantly outperform the state-of-the-art models on four benchmark datasets.
arXiv Detail & Related papers (2023-03-20T13:58:35Z)
Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss [80.79641247882012]
We focus on unsupervised feature learning for Multimodal Emotion Recognition (MER) We consider discrete emotions, and as modalities text, audio and vision are used. Our method, as being based on contrastive loss between pairwise modalities, is the first attempt in MER literature.
arXiv Detail & Related papers (2022-07-23T10:11:24Z)
M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition. Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction. Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.