Group Gated Fusion on Attention-based Bidirectional Alignment for
  Multimodal Emotion Recognition
        - URL: http://arxiv.org/abs/2201.06309v1
- Date: Mon, 17 Jan 2022 09:46:59 GMT
- Title: Group Gated Fusion on Attention-based Bidirectional Alignment for
  Multimodal Emotion Recognition
- Authors: Pengfei Liu, Kun Li and Helen Meng
- Abstract summary: This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states.
We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly.
The proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
- Score: 63.07844685982738
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract:   Emotion recognition is a challenging and actively-studied research area that
plays a critical role in emotion-aware human-computer interaction systems. In a
multimodal setting, temporal alignment between different modalities has not
been well investigated yet. This paper presents a new model named as Gated
Bidirectional Alignment Network (GBAN), which consists of an attention-based
bidirectional alignment network over LSTM hidden states to explicitly capture
the alignment relationship between speech and text, and a novel group gated
fusion (GGF) layer to integrate the representations of different modalities. We
empirically show that the attention-aligned representations outperform the
last-hidden-states of LSTM significantly, and the proposed GBAN model
outperforms existing state-of-the-art multimodal approaches on the IEMOCAP
dataset.
 
      
        Related papers
        - GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention   and Modality-Invariant Learning Constraints [24.242098942377574]
 Multimodal emotion recognition (MER) extracts emotions from multimodal data, including visual, speech, and text inputs, playing a key role in human-computer interaction.<n>We propose a gated interactive attention mechanism to adaptively extract modality-specific features while enhancing emotional information through pairwise interactions.<n> Experiments on IEMOCAP demonstrate that our method outperforms state-of-the-art MER approaches, achieving WA 80.7% and UA 81.3%.
 arXiv  Detail & Related papers  (2025-06-01T07:07:02Z)
- Enhancing Multimodal Emotion Recognition through Multi-Granularity   Cross-Modal Alignment [10.278127492434297]
 This paper introduces a Multi-Granularity Cross-Modal Alignment (MGCMA) framework, distinguished by its comprehensive approach encompassing distribution-based, instance-based, and token-based alignment modules.
Our experiments on IEMOCAP demonstrate that our proposed method outperforms current state-of-the-art techniques.
 arXiv  Detail & Related papers  (2024-12-30T09:30:41Z)
- WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition [2.3367170233149324]
 We propose WavFusion, a multimodal speech emotion recognition framework.
WavFusion addresses critical research problems in effective multimodal fusion, among modalities, and discriminative representation learning.
Our work highlights the importance of capturing nuanced cross-modal interactions and learning discriminative representations for accurate multimodal SER.
 arXiv  Detail & Related papers  (2024-12-07T06:43:39Z)
- Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment   Dynamics for Multimodal Emotion Recognition [37.12407597998884]
 A novel approach named GraphSmile is proposed for tracking intricate emotional cues in multimodal dialogues.
GraphSmile comprises two key components, i.e., GSF and SDP modules.
 Empirical results on multiple benchmarks demonstrate that GraphSmile can handle complex emotional and sentimental patterns.
 arXiv  Detail & Related papers  (2024-07-31T11:47:36Z)
- Masked Graph Learning with Recurrent Alignment for Multimodal Emotion   Recognition in Conversation [12.455034591553506]
 Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
 arXiv  Detail & Related papers  (2024-07-23T02:23:51Z)
- AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension   Transformation for Emotion Recognition in Conversations [57.99479708224221]
 We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
 arXiv  Detail & Related papers  (2024-04-12T11:31:18Z)
- Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
 Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
 arXiv  Detail & Related papers  (2024-03-15T17:23:38Z)
- Improving Anomaly Segmentation with Multi-Granularity Cross-Domain
  Alignment [17.086123737443714]
 Anomaly segmentation plays a pivotal role in identifying atypical objects in images, crucial for hazard detection in autonomous driving systems.
While existing methods demonstrate noteworthy results on synthetic data, they often fail to consider the disparity between synthetic and real-world data domains.
We introduce the Multi-Granularity Cross-Domain Alignment framework, tailored to harmonize features across domains at both the scene and individual sample levels.
 arXiv  Detail & Related papers  (2023-08-16T22:54:49Z)
- Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
 Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
 arXiv  Detail & Related papers  (2023-07-19T02:11:19Z)
- MIR-GAN: Refining Frame-Level Modality-Invariant Representations with
  Adversarial Network for Audio-Visual Speech Recognition [23.042478625584653]
 We propose an adversarial network to refine frame-level modality-invariant representations (MIR-GAN)
In particular, we propose an adversarial network to refine frame-level modality-invariant representations (MIR-GAN)
 arXiv  Detail & Related papers  (2023-06-18T14:02:20Z)
- Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
 This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations.
Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation.
Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
 arXiv  Detail & Related papers  (2021-09-15T08:21:01Z)
- Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
  Sentiment Analysis [96.46952672172021]
 Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
 arXiv  Detail & Related papers  (2021-07-28T23:33:42Z)
- Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
  Re-Identification [208.1227090864602]
 Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
 arXiv  Detail & Related papers  (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.