Related papers: Gaussian Adaptive Attention is All You Need: Robust Contextual Representations Across Multiple Modalities

Gaussian Adaptive Attention is All You Need: Robust Contextual Representations Across Multiple Modalities

URL: http://arxiv.org/abs/2401.11143v3
Date: Wed, 31 Jan 2024 01:22:43 GMT
Title: Gaussian Adaptive Attention is All You Need: Robust Contextual Representations Across Multiple Modalities
Authors: Georgios Ioannides, Aman Chadha, Aaron Elkins
Abstract summary: We propose the Multi-Head Gaussian Adaptive Attention Mechanism (GAAM), a novel probabilistic attention framework. GAAM integrates learnable mean and variance into its attention mechanism, implemented in a Multi-Headed framework. We introduce the Importance Factor (IF), a new learning-based metric that enhances the explainability of models trained with GAAM-based methods.
Score: 1.03590082373586
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose the Multi-Head Gaussian Adaptive Attention Mechanism (GAAM), a novel probabilistic attention framework, and the Gaussian Adaptive Transformer (GAT), designed to enhance information aggregation across multiple modalities, including Speech, Text and Vision. GAAM integrates learnable mean and variance into its attention mechanism, implemented in a Multi-Headed framework enabling it to collectively model any Probability Distribution for dynamic recalibration of feature significance. This method demonstrates significant improvements, especially with highly non-stationary data, surpassing the state-of-the-art attention techniques in model performance (up to approximately +20% in accuracy) by identifying key elements within the feature space. GAAM's compatibility with dot-product-based attention models and relatively low number of parameters showcases its adaptability and potential to boost existing attention frameworks. Empirically, GAAM exhibits superior adaptability and efficacy across a diverse range of tasks, including emotion recognition in speech, image classification, and text classification, thereby establishing its robustness and versatility in handling multi-modal data. Furthermore, we introduce the Importance Factor (IF), a new learning-based metric that enhances the explainability of models trained with GAAM-based methods. Overall, GAAM represents an advancement towards development of better performing and more explainable attention models across multiple modalities.

Related papers

AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis [1.0858565995100635]
We introduce AdaptiSent, a new framework for Multimodal Aspect-Based Sentiment Analysis (MABSA)<n>Our model integrates dynamic modality weighting and context-adaptive attention, enhancing the extraction of sentiment and aspect-related information.<n>Results from standard Twitter datasets show that AdaptiSent surpasses existing models in precision, recall, and F1 score.
arXiv Detail & Related papers (2025-07-17T00:06:43Z)
Learning to Fuse: Modality-Aware Adaptive Scheduling for Robust Multimodal Foundation Models [0.0]
Modality-Aware Adaptive Fusion Scheduling (MA-AFS) learns to dynamically modulate the contribution of each modality on a per-instance basis.<n>Our work highlights the importance of adaptive fusion and opens a promising direction toward reliable and uncertainty-aware multimodal learning.
arXiv Detail & Related papers (2025-06-15T05:57:45Z)
UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning [35.62208317531141]
We advocate and introduce the unrolling paradigm, also referred to as "learning to optimize" Our unrolling approach covers various statistical feature distributions and pre-training paradigms. We report comprehensive experiments, which cover a breadth of fine-grained downstream image classification tasks.
arXiv Detail & Related papers (2024-12-21T19:01:57Z)
GAMED: Knowledge Adaptive Multi-Experts Decoupling for Multimodal Fake News Detection [18.157900272828602]
Multimodal fake news detection often involves modelling heterogeneous data sources, such as vision and language. This paper develops a significantly novel approach, GAMED, for multimodal modelling. It focuses on generating distinctive and discriminative features through modal decoupling to enhance cross-modal synergies.
arXiv Detail & Related papers (2024-12-11T19:12:22Z)
MCSFF: Multi-modal Consistency and Specificity Fusion Framework for Entity Alignment [7.109735168520378]
Multi-modal entity alignment (MMEA) is essential for enhancing knowledge graphs and improving question-answering systems. Existing methods often focus on integrating modalities through their complementarity but overlook the specificity of each modality. We propose the Multi-modal Consistency and Specificity Fusion Framework (MCSFF), which innovatively integrates both complementary and specific aspects of modalities.
arXiv Detail & Related papers (2024-10-18T16:35:25Z)
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.39859547619156]
We propose MMEvol, a novel multimodal instruction data evolution framework. MMEvol iteratively improves data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution. Our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
arXiv Detail & Related papers (2024-09-09T17:44:00Z)
AdapMTL: Adaptive Pruning Framework for Multitask Learning Model [5.643658120200373]
AdapMTL is an adaptive pruning framework for multitask models. It balances sparsity allocation and accuracy performance across multiple tasks. It showcases superior performance compared to state-of-the-art pruning methods.
arXiv Detail & Related papers (2024-08-07T17:19:15Z)
Pre-training Feature Guided Diffusion Model for Speech Enhancement [37.88469730135598]
Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments. We introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement.
arXiv Detail & Related papers (2024-06-11T18:22:59Z)
Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD) It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z)
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications. Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis [22.012103941836838]
We present Adaptive Language-guided Multimodal Transformer (ALMT) ALMT incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation. ALMT achieves state-of-the-art performance on several popular datasets.
arXiv Detail & Related papers (2023-10-09T15:43:07Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Trusted Multi-View Classification with Dynamic Evidential Fusion [73.35990456162745]
We propose a novel multi-view classification algorithm, termed trusted multi-view classification (TMC) TMC provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level. Both theoretical and experimental results validate the effectiveness of the proposed model in accuracy, robustness and trustworthiness.
arXiv Detail & Related papers (2022-04-25T03:48:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.