Heterogeneous bimodal attention fusion for speech emotion recognition
- URL: http://arxiv.org/abs/2503.06405v2
- Date: Sun, 23 Mar 2025 08:21:43 GMT
- Title: Heterogeneous bimodal attention fusion for speech emotion recognition
- Authors: Jiachen Luo, Huy Phan, Lin Wang, Joshua Reiss,
- Abstract summary: We propose a novel framework called Heterogeneous Bimodal Attention Fusion (HBAF) for multi-level multi-modal interaction in conversational emotion recognition.<n>The uni-modal representation module incorporates contextual content into low-level audio representations to bridge the heterogeneous multi-modal gap.<n>The multi-modal fusion module uses dynamic bimodal attention and a dynamic gating mechanism to filter incorrect cross-modal relationships.
- Score: 14.24203981584309
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal emotion recognition in conversations is a challenging problem due to the complex and complementary interactions between different modalities. Audio and textual cues are particularly important for understanding emotions from a human perspective. Most existing studies focus on exploring interactions between audio and text modalities at the same representation level. However, a critical issue is often overlooked: the heterogeneous modality gap between low-level audio representations and high-level text representations. To address this problem, we propose a novel framework called Heterogeneous Bimodal Attention Fusion (HBAF) for multi-level multi-modal interaction in conversational emotion recognition. The proposed method comprises three key modules: the uni-modal representation module, the multi-modal fusion module, and the inter-modal contrastive learning module. The uni-modal representation module incorporates contextual content into low-level audio representations to bridge the heterogeneous multi-modal gap, enabling more effective fusion. The multi-modal fusion module uses dynamic bimodal attention and a dynamic gating mechanism to filter incorrect cross-modal relationships and fully exploit both intra-modal and inter-modal interactions. Finally, the inter-modal contrastive learning module captures complex absolute and relative interactions between audio and text modalities. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed HBAF method outperforms existing state-of-the-art baselines.
Related papers
- Bimodal Connection Attention Fusion for Speech Emotion Recognition [17.5756663655978]
We propose Bimodal Connection Attention Fusion (BCAF) method to build effective bimodal speech emotion recognition systems.<n>BCAF method includes three main modules: the interactive connection network, the bimodal attention network, and the correlative attention network.<n> Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed BCAF method outperforms existing state-of-the-art baselines.
arXiv Detail & Related papers (2025-03-08T10:20:57Z) - Cross-modal Context Fusion and Adaptive Graph Convolutional Network for Multimodal Conversational Emotion Recognition [0.0]
This article proposes a new multimodal emotion recognition method, including a cross modal context fusion module, an adaptive graph convolutional encoding module, and an emotion classification module.<n>Our model has surpassed some state-of-the-art methods on publicly available benchmark datasets and achieved high recognition accuracy.
arXiv Detail & Related papers (2025-01-25T03:53:53Z) - Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.<n>We introduce a new approach that models video-text as game players using multivariate cooperative game theory.<n>We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in
Group Conversations [39.79734528362605]
Multimodal Attention Network captures cross-modal interactions at various levels of spatial abstraction.
AMuSE model condenses both spatial and temporal features into two dense descriptors: speaker-level and utterance-level.
arXiv Detail & Related papers (2024-01-26T19:17:05Z) - MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts [92.76662894585809]
We introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE)
MMoE is able to be applied to various types of models to gain improvement.
arXiv Detail & Related papers (2023-11-16T05:31:21Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - InterMulti:Multi-view Multimodal Interactions with Text-dominated
Hierarchical High-order Fusion for Emotion Analysis [10.048903012988882]
We propose a multimodal emotion analysis framework, InterMulti, to capture complex multimodal interactions from different views.
Our proposed framework decomposes signals of different modalities into three kinds of multimodal interaction representations.
THHF module reasonably integrates the above three kinds of representations into a comprehensive multimodal interaction representation.
arXiv Detail & Related papers (2022-12-20T07:02:32Z) - MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in
Conversations [5.5997926295092295]
multimodal Emotion Recognition in Conversations (ERC) has considerable prospects for developing empathetic machines.
Recent graph-based fusion methods aggregate multimodal information by exploring unimodal and cross-modal interactions in a graph.
We propose a novel Multimodal Dynamic Fusion Network (MM-DFN) to recognize emotions by fully understanding multimodal conversational context.
arXiv Detail & Related papers (2022-03-04T15:42:53Z) - High-Modality Multimodal Transformer: Quantifying Modality & Interaction
Heterogeneity for High-Modality Representation Learning [112.51498431119616]
This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities.
A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
arXiv Detail & Related papers (2022-03-02T18:56:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.