Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion
Recognition?
- URL: http://arxiv.org/abs/2202.09263v1
- Date: Fri, 18 Feb 2022 15:44:14 GMT
- Title: Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion
Recognition?
- Authors: Vandana Rajan, Alessio Brutti, Andrea Cavallaro
- Abstract summary: Cross-modal attention is seen as an effective mechanism for multi-modal fusion.
We implement and compare a cross-attention and a self-attention model.
We compare the models using different modality combinations for a 7-class emotion classification task.
- Score: 36.67937514793215
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans express their emotions via facial expressions, voice intonation and
word choices. To infer the nature of the underlying emotion, recognition models
may use a single modality, such as vision, audio, and text, or a combination of
modalities. Generally, models that fuse complementary information from multiple
modalities outperform their uni-modal counterparts. However, a successful model
that fuses modalities requires components that can effectively aggregate
task-relevant information from each modality. As cross-modal attention is seen
as an effective mechanism for multi-modal fusion, in this paper we quantify the
gain that such a mechanism brings compared to the corresponding self-attention
mechanism. To this end, we implement and compare a cross-attention and a
self-attention model. In addition to attention, each model uses convolutional
layers for local feature extraction and recurrent layers for global sequential
modelling. We compare the models using different modality combinations for a
7-class emotion classification task using the IEMOCAP dataset. Experimental
results indicate that albeit both models improve upon the state-of-the-art in
terms of weighted and unweighted accuracy for tri- and bi-modal configurations,
their performance is generally statistically comparable. The code to replicate
the experiments is available at https://github.com/smartcameras/SelfCrossAttn
Related papers
- Towards a Generalist and Blind RGB-X Tracker [91.36268768952755]
We develop a single model tracker that can remain blind to any modality X during inference time.
Our training process is extremely simple, integrating multi-label classification loss with a routing function.
Our generalist and blind tracker can achieve competitive performance compared to well-established modal-specific models.
arXiv Detail & Related papers (2024-05-28T03:00:58Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Cross-Language Speech Emotion Recognition Using Multimodal Dual
Attention Transformers [5.538923337818467]
State-of-the-art systems are unable to achieve improved performance in cross-language settings.
We propose a Multimodal Dual Attention Transformer model to improve cross-language SER.
arXiv Detail & Related papers (2023-06-23T22:38:32Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - A Self-Adjusting Fusion Representation Learning Model for Unaligned
Text-Audio Sequences [16.38826799727453]
How to integrate relevant information of each modality to learn fusion representations has been one of the central challenges in multimodal learning.
In this paper, a Self-Adjusting Fusion Representation Learning Model is proposed to learn robust crossmodal fusion representations directly from the unaligned text and audio sequences.
Experiment results show that our model has significantly improved the performance of all the metrics on the unaligned text-audio sequences.
arXiv Detail & Related papers (2022-11-12T13:05:28Z) - Multimodal End-to-End Group Emotion Recognition using Cross-Modal
Attention [0.0]
Classifying group-level emotions is a challenging task due to complexity of video.
Our model achieves best validation accuracy of 60.37% which is approximately 8.5% higher, than VGAF dataset baseline.
arXiv Detail & Related papers (2021-11-10T19:19:26Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Does my multimodal model learn cross-modal interactions? It's harder to
tell than you might think! [26.215781778606168]
Cross-modal modeling seems crucial in multimodal tasks, such as visual question answering.
We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task.
For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation.
arXiv Detail & Related papers (2020-10-13T17:45:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.