Group Gated Fusion on Attention-based Bidirectional Alignment for
Multimodal Emotion Recognition
- URL: http://arxiv.org/abs/2201.06309v1
- Date: Mon, 17 Jan 2022 09:46:59 GMT
- Title: Group Gated Fusion on Attention-based Bidirectional Alignment for
Multimodal Emotion Recognition
- Authors: Pengfei Liu, Kun Li and Helen Meng
- Abstract summary: This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states.
We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly.
The proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
- Score: 63.07844685982738
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Emotion recognition is a challenging and actively-studied research area that
plays a critical role in emotion-aware human-computer interaction systems. In a
multimodal setting, temporal alignment between different modalities has not
been well investigated yet. This paper presents a new model named as Gated
Bidirectional Alignment Network (GBAN), which consists of an attention-based
bidirectional alignment network over LSTM hidden states to explicitly capture
the alignment relationship between speech and text, and a novel group gated
fusion (GGF) layer to integrate the representations of different modalities. We
empirically show that the attention-aligned representations outperform the
last-hidden-states of LSTM significantly, and the proposed GBAN model
outperforms existing state-of-the-art multimodal approaches on the IEMOCAP
dataset.
Related papers
- Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition [37.12407597998884]
A novel approach named GraphSmile is proposed for tracking intricate emotional cues in multimodal dialogues.
GraphSmile comprises two key components, i.e., GSF and SDP modules.
Empirical results on multiple benchmarks demonstrate that GraphSmile can handle complex emotional and sentimental patterns.
arXiv Detail & Related papers (2024-07-31T11:47:36Z) - Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Improving Anomaly Segmentation with Multi-Granularity Cross-Domain
Alignment [17.086123737443714]
Anomaly segmentation plays a pivotal role in identifying atypical objects in images, crucial for hazard detection in autonomous driving systems.
While existing methods demonstrate noteworthy results on synthetic data, they often fail to consider the disparity between synthetic and real-world data domains.
We introduce the Multi-Granularity Cross-Domain Alignment framework, tailored to harmonize features across domains at both the scene and individual sample levels.
arXiv Detail & Related papers (2023-08-16T22:54:49Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - MIR-GAN: Refining Frame-Level Modality-Invariant Representations with
Adversarial Network for Audio-Visual Speech Recognition [23.042478625584653]
We propose an adversarial network to refine frame-level modality-invariant representations (MIR-GAN)
In particular, we propose an adversarial network to refine frame-level modality-invariant representations (MIR-GAN)
arXiv Detail & Related papers (2023-06-18T14:02:20Z) - Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations.
Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation.
Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv Detail & Related papers (2021-09-15T08:21:01Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.