A Low-rank Matching Attention based Cross-modal Feature Fusion Method
for Conversational Emotion Recognition
- URL: http://arxiv.org/abs/2306.17799v1
- Date: Fri, 16 Jun 2023 16:02:44 GMT
- Title: A Low-rank Matching Attention based Cross-modal Feature Fusion Method
for Conversational Emotion Recognition
- Authors: Yuntao Shou, Xiangyong Cao, Deyu Meng, Bo Dong, Qinghua Zheng
- Abstract summary: This paper develops a novel cross-modal feature fusion method for the Conversational emotion recognition (CER) task.
By setting a matching weight and calculating attention scores between modal features row by row, LMAM contains fewer parameters than the self-attention method.
We show that LMAM can be embedded into any existing state-of-the-art DL-based CER methods and help boost their performance in a plug-and-play manner.
- Score: 56.20144064187554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational emotion recognition (CER) is an important research topic in
human-computer interactions. Although deep learning (DL) based CER approaches
have achieved excellent performance, existing cross-modal feature fusion
methods used in these DL-based approaches either ignore the intra-modal and
inter-modal emotional interaction or have high computational complexity. To
address these issues, this paper develops a novel cross-modal feature fusion
method for the CER task, i.e., the low-rank matching attention method (LMAM).
By setting a matching weight and calculating attention scores between modal
features row by row, LMAM contains fewer parameters than the self-attention
method. We further utilize the low-rank decomposition method on the weight to
make the parameter number of LMAM less than one-third of the self-attention.
Therefore, LMAM can potentially alleviate the over-fitting issue caused by a
large number of parameters. Additionally, by computing and fusing the
similarity of intra-modal and inter-modal features, LMAM can also fully exploit
the intra-modal contextual information within each modality and the
complementary semantic information across modalities (i.e., text, video and
audio) simultaneously. Experimental results on some benchmark datasets show
that LMAM can be embedded into any existing state-of-the-art DL-based CER
methods and help boost their performance in a plug-and-play manner. Also,
experimental results verify the superiority of LMAM compared with other popular
cross-modal fusion methods. Moreover, LMAM is a general cross-modal fusion
method and can thus be applied to other multi-modal recognition tasks, e.g.,
session recommendation and humour detection.
Related papers
- GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints [24.242098942377574]
Multimodal emotion recognition (MER) extracts emotions from multimodal data, including visual, speech, and text inputs, playing a key role in human-computer interaction.<n>We propose a gated interactive attention mechanism to adaptively extract modality-specific features while enhancing emotional information through pairwise interactions.<n> Experiments on IEMOCAP demonstrate that our method outperforms state-of-the-art MER approaches, achieving WA 80.7% and UA 81.3%.
arXiv Detail & Related papers (2025-06-01T07:07:02Z) - A Novel Approach to for Multimodal Emotion Recognition : Multimodal semantic information fusion [3.1409950035735914]
This paper proposes a novel multimodal emotion recognition approach, DeepMSI-MER, based on the integration of contrastive learning and visual sequence compression.
Experimental results on two public datasets, IEMOCAP and MELD, demonstrate that DeepMSI-MER significantly improves the accuracy and robustness of emotion recognition.
arXiv Detail & Related papers (2025-02-12T17:07:43Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD)
It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images.
A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Multimodal Prompt Transformer with Hybrid Contrastive Learning for
Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems.
Deep emotion cues extraction was performed on modalities with strong representation ability.
Feature filters were designed as multimodal prompt information for modalities with weak representation ability.
MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space
Using Joint Cross-Attention [15.643176705932396]
We introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities.
It computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities.
Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-09-19T15:01:55Z) - Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition [13.994609732846344]
Most effective techniques for emotion recognition efficiently leverage diverse and complimentary sources of information.
We introduce a cross-attentional fusion approach to extract the salient features across audio-visual (A-V) modalities.
Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches.
arXiv Detail & Related papers (2021-11-09T16:01:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.