A Low-rank Matching Attention based Cross-modal Feature Fusion Method
for Conversational Emotion Recognition
- URL: http://arxiv.org/abs/2306.17799v1
- Date: Fri, 16 Jun 2023 16:02:44 GMT
- Title: A Low-rank Matching Attention based Cross-modal Feature Fusion Method
for Conversational Emotion Recognition
- Authors: Yuntao Shou, Xiangyong Cao, Deyu Meng, Bo Dong, Qinghua Zheng
- Abstract summary: This paper develops a novel cross-modal feature fusion method for the Conversational emotion recognition (CER) task.
By setting a matching weight and calculating attention scores between modal features row by row, LMAM contains fewer parameters than the self-attention method.
We show that LMAM can be embedded into any existing state-of-the-art DL-based CER methods and help boost their performance in a plug-and-play manner.
- Score: 56.20144064187554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational emotion recognition (CER) is an important research topic in
human-computer interactions. Although deep learning (DL) based CER approaches
have achieved excellent performance, existing cross-modal feature fusion
methods used in these DL-based approaches either ignore the intra-modal and
inter-modal emotional interaction or have high computational complexity. To
address these issues, this paper develops a novel cross-modal feature fusion
method for the CER task, i.e., the low-rank matching attention method (LMAM).
By setting a matching weight and calculating attention scores between modal
features row by row, LMAM contains fewer parameters than the self-attention
method. We further utilize the low-rank decomposition method on the weight to
make the parameter number of LMAM less than one-third of the self-attention.
Therefore, LMAM can potentially alleviate the over-fitting issue caused by a
large number of parameters. Additionally, by computing and fusing the
similarity of intra-modal and inter-modal features, LMAM can also fully exploit
the intra-modal contextual information within each modality and the
complementary semantic information across modalities (i.e., text, video and
audio) simultaneously. Experimental results on some benchmark datasets show
that LMAM can be embedded into any existing state-of-the-art DL-based CER
methods and help boost their performance in a plug-and-play manner. Also,
experimental results verify the superiority of LMAM compared with other popular
cross-modal fusion methods. Moreover, LMAM is a general cross-modal fusion
method and can thus be applied to other multi-modal recognition tasks, e.g.,
session recommendation and humour detection.
Related papers
- Completed Feature Disentanglement Learning for Multimodal MRIs Analysis [36.32164729310868]
Feature disentanglement (FD)-based methods have achieved significant success in multimodal learning (MML)
We propose a novel Complete Feature Disentanglement (CFD) strategy that recovers the lost information during feature decoupling.
Specifically, the CFD strategy not only identifies modality-shared and modality-specific features, but also decouples shared features among subsets of multimodal inputs.
arXiv Detail & Related papers (2024-07-06T01:49:38Z) - How Intermodal Interaction Affects the Performance of Deep Multimodal Fusion for Mixed-Type Time Series [3.6958071416494414]
Mixed-type time series (MTTS) is a bimodal data type common in many domains, such as healthcare, finance, environmental monitoring, and social media.
The integration of both modalities through multimodal fusion is a promising approach for processing MTTS.
We present a comprehensive evaluation of several deep multimodal fusion approaches for MTTS forecasting.
arXiv Detail & Related papers (2024-06-21T12:26:48Z) - Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD)
It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images.
A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Multimodal Hyperspectral Image Classification via Interconnected Fusion [12.41850641917384]
An Interconnected Fusion (IF) framework is proposed to explore the relationships across HSI and LiDAR modalities comprehensively.
Experiments have been conducted on three widely used datasets: Trento, MUUFL, and Houston.
arXiv Detail & Related papers (2023-04-02T09:46:13Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z) - Improving Multimodal Fusion with Hierarchical Mutual Information
Maximization for Multimodal Sentiment Analysis [16.32509144501822]
We propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs.
The framework is jointly trained with the main task (MSA) to improve the performance of the downstream MSA task.
arXiv Detail & Related papers (2021-09-01T14:45:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.