Gaussian Adaptive Attention is All You Need: Robust Contextual
Representations Across Multiple Modalities
- URL: http://arxiv.org/abs/2401.11143v3
- Date: Wed, 31 Jan 2024 01:22:43 GMT
- Title: Gaussian Adaptive Attention is All You Need: Robust Contextual
Representations Across Multiple Modalities
- Authors: Georgios Ioannides, Aman Chadha, Aaron Elkins
- Abstract summary: We propose the Multi-Head Gaussian Adaptive Attention Mechanism (GAAM), a novel probabilistic attention framework.
GAAM integrates learnable mean and variance into its attention mechanism, implemented in a Multi-Headed framework.
We introduce the Importance Factor (IF), a new learning-based metric that enhances the explainability of models trained with GAAM-based methods.
- Score: 1.03590082373586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose the Multi-Head Gaussian Adaptive Attention Mechanism (GAAM), a
novel probabilistic attention framework, and the Gaussian Adaptive Transformer
(GAT), designed to enhance information aggregation across multiple modalities,
including Speech, Text and Vision. GAAM integrates learnable mean and variance
into its attention mechanism, implemented in a Multi-Headed framework enabling
it to collectively model any Probability Distribution for dynamic recalibration
of feature significance. This method demonstrates significant improvements,
especially with highly non-stationary data, surpassing the state-of-the-art
attention techniques in model performance (up to approximately +20% in
accuracy) by identifying key elements within the feature space. GAAM's
compatibility with dot-product-based attention models and relatively low number
of parameters showcases its adaptability and potential to boost existing
attention frameworks. Empirically, GAAM exhibits superior adaptability and
efficacy across a diverse range of tasks, including emotion recognition in
speech, image classification, and text classification, thereby establishing its
robustness and versatility in handling multi-modal data. Furthermore, we
introduce the Importance Factor (IF), a new learning-based metric that enhances
the explainability of models trained with GAAM-based methods. Overall, GAAM
represents an advancement towards development of better performing and more
explainable attention models across multiple modalities.
Related papers
- An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Learning Language-guided Adaptive Hyper-modality Representation for
Multimodal Sentiment Analysis [22.012103941836838]
We present Adaptive Language-guided Multimodal Transformer (ALMT)
ALMT incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation.
ALMT achieves state-of-the-art performance on several popular datasets.
arXiv Detail & Related papers (2023-10-09T15:43:07Z) - Enhanced LFTSformer: A Novel Long-Term Financial Time Series Prediction Model Using Advanced Feature Engineering and the DS Encoder Informer Architecture [0.8532753451809455]
This study presents a groundbreaking model for forecasting long-term financial time series, termed the Enhanced LFTSformer.
The model distinguishes itself through several significant innovations.
Systematic experimentation on a range of benchmark stock market datasets demonstrates that the Enhanced LFTSformer outperforms traditional machine learning models.
arXiv Detail & Related papers (2023-10-03T08:37:21Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - Cross-Language Speech Emotion Recognition Using Multimodal Dual
Attention Transformers [5.538923337818467]
State-of-the-art systems are unable to achieve improved performance in cross-language settings.
We propose a Multimodal Dual Attention Transformer model to improve cross-language SER.
arXiv Detail & Related papers (2023-06-23T22:38:32Z) - IMKGA-SM: Interpretable Multimodal Knowledge Graph Answer Prediction via
Sequence Modeling [3.867363075280544]
Multimodal knowledge graph link prediction aims to improve the accuracy and efficiency of link prediction tasks for multimodal data.
New model is developed, namely Interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling (IMKGA-SM)
Model achieves much better performance than SOTA baselines on multimodal link prediction datasets of different sizes.
arXiv Detail & Related papers (2023-01-06T10:08:11Z) - Correlation Information Bottleneck: Towards Adapting Pretrained
Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z) - Trusted Multi-View Classification [76.73585034192894]
We propose a novel multi-view classification method, termed trusted multi-view classification.
It provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level.
The proposed algorithm jointly utilizes multiple views to promote both classification reliability and robustness.
arXiv Detail & Related papers (2021-02-03T13:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.