TACOformer:Token-channel compounded Cross Attention for Multimodal
Emotion Recognition
- URL: http://arxiv.org/abs/2306.13592v2
- Date: Mon, 21 Aug 2023 16:37:46 GMT
- Title: TACOformer:Token-channel compounded Cross Attention for Multimodal
Emotion Recognition
- Authors: Xinda Li
- Abstract summary: We propose a comprehensive perspective of multimodal fusion that integrates channel-level and token-level cross-modal interactions.
Specifically, we introduce a unified cross attention module called Token-chAnnel COmpound (TACO) Cross Attention.
We also propose a 2D position encoding method to preserve information about the spatial distribution of EEG signal channels.
- Score: 0.951828574518325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, emotion recognition based on physiological signals has emerged as a
field with intensive research. The utilization of multi-modal, multi-channel
physiological signals has significantly improved the performance of emotion
recognition systems, due to their complementarity. However, effectively
integrating emotion-related semantic information from different modalities and
capturing inter-modal dependencies remains a challenging issue. Many existing
multimodal fusion methods ignore either token-to-token or channel-to-channel
correlations of multichannel signals from different modalities, which limits
the classification capability of the models to some extent. In this paper, we
propose a comprehensive perspective of multimodal fusion that integrates
channel-level and token-level cross-modal interactions. Specifically, we
introduce a unified cross attention module called Token-chAnnel COmpound (TACO)
Cross Attention to perform multimodal fusion, which simultaneously models
channel-level and token-level dependencies between modalities. Additionally, we
propose a 2D position encoding method to preserve information about the spatial
distribution of EEG signal channels, then we use two transformer encoders ahead
of the fusion module to capture long-term temporal dependencies from the EEG
signal and the peripheral physiological signal, respectively.
Subject-independent experiments on emotional dataset DEAP and Dreamer
demonstrate that the proposed model achieves state-of-the-art performance.
Related papers
- Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - Multimodal Physiological Signals Representation Learning via Multiscale Contrasting for Depression Recognition [18.65975882665568]
Depression based on physiological signals such as functional near-infrared spectroscopy (NIRS) and electroencephalogram (EEG) has made considerable progress.
In this paper, we introduce a multimodal physiological signals representation learning framework using architecture via multiscale contrasting for depression recognition (MRLM)
To enhance the learning of semantic representation associated with stimulation tasks, a semantic contrast module is proposed.
arXiv Detail & Related papers (2024-06-22T09:28:02Z) - Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD)
It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images.
A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Transformer-based Self-supervised Multimodal Representation Learning for
Wearable Emotion Recognition [2.4364387374267427]
We propose a novel self-supervised learning (SSL) framework for wearable emotion recognition.
Our method achieved state-of-the-art results in various emotion classification tasks.
arXiv Detail & Related papers (2023-03-29T19:45:55Z) - Group Gated Fusion on Attention-based Bidirectional Alignment for
Multimodal Emotion Recognition [63.07844685982738]
This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states.
We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly.
The proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
arXiv Detail & Related papers (2022-01-17T09:46:59Z) - Deep Multimodal Fusion by Channel Exchanging [87.40768169300898]
This paper proposes a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities.
The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network.
arXiv Detail & Related papers (2020-11-10T09:53:20Z) - Low Rank Fusion based Transformers for Multimodal Sequences [9.507869508188266]
We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets.
We show that our models have lesser parameters, train faster and perform comparably to many larger fusion-based architectures.
arXiv Detail & Related papers (2020-07-04T08:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.