Related papers: A Low-rank Matching Attention based Cross-modal Feature Fusion Method for Conversational Emotion Recognition

A Low-rank Matching Attention based Cross-modal Feature Fusion Method for Conversational Emotion Recognition

URL: http://arxiv.org/abs/2306.17799v1
Date: Fri, 16 Jun 2023 16:02:44 GMT
Title: A Low-rank Matching Attention based Cross-modal Feature Fusion Method for Conversational Emotion Recognition
Authors: Yuntao Shou, Xiangyong Cao, Deyu Meng, Bo Dong, Qinghua Zheng
Abstract summary: This paper develops a novel cross-modal feature fusion method for the Conversational emotion recognition (CER) task. By setting a matching weight and calculating attention scores between modal features row by row, LMAM contains fewer parameters than the self-attention method. We show that LMAM can be embedded into any existing state-of-the-art DL-based CER methods and help boost their performance in a plug-and-play manner.
Score: 56.20144064187554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conversational emotion recognition (CER) is an important research topic in human-computer interactions. Although deep learning (DL) based CER approaches have achieved excellent performance, existing cross-modal feature fusion methods used in these DL-based approaches either ignore the intra-modal and inter-modal emotional interaction or have high computational complexity. To address these issues, this paper develops a novel cross-modal feature fusion method for the CER task, i.e., the low-rank matching attention method (LMAM). By setting a matching weight and calculating attention scores between modal features row by row, LMAM contains fewer parameters than the self-attention method. We further utilize the low-rank decomposition method on the weight to make the parameter number of LMAM less than one-third of the self-attention. Therefore, LMAM can potentially alleviate the over-fitting issue caused by a large number of parameters. Additionally, by computing and fusing the similarity of intra-modal and inter-modal features, LMAM can also fully exploit the intra-modal contextual information within each modality and the complementary semantic information across modalities (i.e., text, video and audio) simultaneously. Experimental results on some benchmark datasets show that LMAM can be embedded into any existing state-of-the-art DL-based CER methods and help boost their performance in a plug-and-play manner. Also, experimental results verify the superiority of LMAM compared with other popular cross-modal fusion methods. Moreover, LMAM is a general cross-modal fusion method and can thus be applied to other multi-modal recognition tasks, e.g., session recommendation and humour detection.

Related papers

GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints [24.242098942377574]
Multimodal emotion recognition (MER) extracts emotions from multimodal data, including visual, speech, and text inputs, playing a key role in human-computer interaction.<n>We propose a gated interactive attention mechanism to adaptively extract modality-specific features while enhancing emotional information through pairwise interactions.<n> Experiments on IEMOCAP demonstrate that our method outperforms state-of-the-art MER approaches, achieving WA 80.7% and UA 81.3%.
arXiv Detail & Related papers (2025-06-01T07:07:02Z)
A Novel Approach to for Multimodal Emotion Recognition : Multimodal semantic information fusion [3.1409950035735914]
This paper proposes a novel multimodal emotion recognition approach, DeepMSI-MER, based on the integration of contrastive learning and visual sequence compression. Experimental results on two public datasets, IEMOCAP and MELD, demonstrate that DeepMSI-MER significantly improves the accuracy and robustness of emotion recognition.
arXiv Detail & Related papers (2025-02-12T17:07:43Z)
Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations. We introduce a predictive self-attention module to capture reliable context dynamics within modalities. A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities. A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z)
Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD) It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z)
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z)
Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z)
Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems. Deep emotion cues extraction was performed on modalities with strong representation ability. Feature filters were designed as multimodal prompt information for modalities with weak representation ability. MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z)
Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently. We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process. Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z)
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model. HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z)
Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention [15.643176705932396]
We introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities. It computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-09-19T15:01:55Z)
Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition [13.994609732846344]
Most effective techniques for emotion recognition efficiently leverage diverse and complimentary sources of information. We introduce a cross-attentional fusion approach to extract the salient features across audio-visual (A-V) modalities. Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches.
arXiv Detail & Related papers (2021-11-09T16:01:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.