AGSP-DSA: An Adaptive Graph Signal Processing Framework for Robust Multimodal Fusion with Dynamic Semantic Alignment
- URL: http://arxiv.org/abs/2601.18589v1
- Date: Mon, 26 Jan 2026 15:35:03 GMT
- Title: AGSP-DSA: An Adaptive Graph Signal Processing Framework for Robust Multimodal Fusion with Dynamic Semantic Alignment
- Authors: KV Karthikeya, Ashok Kumar Das, Shantanu Pal, Vivekananda Bhat K, Arun Sekar Rajasekaran,
- Abstract summary: We introduce an Adaptive Graph Signal Processing with Dynamic Semantic Alignment (AGSP DSA) framework to perform robust multimodal data fusion over heterogeneous sources.<n>The experimental outcomes on three benchmark datasets, including CMU-MOSEI, AVE, and MM-IMDB, show that AGSP-DSA performs as the state of the art.
- Score: 18.39945426205332
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In this paper, we introduce an Adaptive Graph Signal Processing with Dynamic Semantic Alignment (AGSP DSA) framework to perform robust multimodal data fusion over heterogeneous sources, including text, audio, and images. The requested approach uses a dual-graph construction to learn both intra-modal and inter-modal relations, spectral graph filtering to boost the informative signals, and effective node embedding with Multi-scale Graph Convolutional Networks (GCNs). Semantic aware attention mechanism: each modality may dynamically contribute to the context with respect to contextual relevance. The experimental outcomes on three benchmark datasets, including CMU-MOSEI, AVE, and MM-IMDB, show that AGSP-DSA performs as the state of the art. More precisely, it achieves 95.3% accuracy, 0.936 F1-score, and 0.924 mAP on CMU-MOSEI, improving MM-GNN by 2.6 percent in accuracy. It gets 93.4% accuracy and 0.911 F1-score on AVE and 91.8% accuracy and 0.886 F1-score on MM-IMDB, which demonstrate good generalization and robustness in the missing modality setting. These findings verify the efficiency of AGSP-DSA in promoting multimodal learning in sentiment analysis, event recognition and multimedia classification.
Related papers
- TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents [0.4881924950569192]
TwoHead-SwinFPN is a unified deep learning architecture that simultaneously performs binary classification and precise localization of manipulated regions in ID documents.<n>Our approach integrates a Swin Transformer backbone with Feature Pyramid Network (FPN) and UNet-style decoder, enhanced with Convolutional Block Attention Module (CBAM) for improved feature representation.<n>Experiments on the FantasyIDiap dataset demonstrate superior performance with 84.31% accuracy, 90.78% AUC for classification, and 57.24% mean Dice score for localization.
arXiv Detail & Related papers (2026-01-19T09:50:51Z) - Advanced Data Collection Techniques in Cloud Security: A Multi-Modal Deep Learning Autoencoder Approach [0.0]
This research presents an innovative method to cloud security by integrating numerous data sources and modalities with multi-modal deep learning autoencoders.<n>The proposed design integrates the best features of six deep learning models: Multi-Modal Deep Learning Autoencoder (MMDLA), Anomaly Detection using Adaptive Metric Learning (ADAM), ADADELTA, ADAGRAD, RMSPROP, and Stacked Graph Transformer (SGT)
arXiv Detail & Related papers (2025-11-26T17:10:54Z) - FAIM: Frequency-Aware Interactive Mamba for Time Series Classification [87.84511960413715]
Time series classification (TSC) is crucial in numerous real-world applications, such as environmental monitoring, medical diagnosis, and posture recognition.<n>We propose FAIM, a lightweight Frequency-Aware Interactive Mamba model.<n>We show that FAIM consistently outperforms existing state-of-the-art (SOTA) methods, achieving a superior trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2025-11-26T08:36:33Z) - G-MSGINet: A Grouped Multi-Scale Graph-Involution Network for Contactless Fingerprint Recognition [20.458766184257147]
G-MSGINet is a unified framework for robust contactless fingerprint recognition.<n>It jointly performs minutiae localization and identity embedding directly from raw input images.<n>Extensive experiments on three benchmark datasets show G-MSGINet consistently achieves minutiae F1-scores in the range of $0.83pm0.02$ and Rank-1 identification accuracies between 97.0% and 99.1%.
arXiv Detail & Related papers (2025-05-13T05:24:24Z) - Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models [0.0]
This project performs multimodal sentiment analysis using the CMU-MOSEI dataset.<n>We use transformer-based models with early fusion to integrate text, audio, and visual modalities.<n>The model achieves strong performance, with 97.87% 7-class accuracy and a 0.9682 F1-score on the test set.
arXiv Detail & Related papers (2025-05-09T15:10:57Z) - Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.<n>We introduce a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.<n>We propose a simple yet effective Test-time Adaptive Cross-modal (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - GCM-Net: Graph-enhanced Cross-Modal Infusion with a Metaheuristic-Driven Network for Video Sentiment and Emotion Analysis [2.012311338995539]
This paper presents a novel framework that leverages the multi-modal contextual information from utterances and applies metaheuristic algorithms to learn for utterance-level sentiment and emotion prediction.
To show the effectiveness of our approach, we have conducted extensive evaluations on three prominent multimodal benchmark datasets.
arXiv Detail & Related papers (2024-10-02T10:07:48Z) - NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild.
NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities.
We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z) - FLASH: Federated Learning Across Simultaneous Heterogeneities [55.0981921695672]
FLASH(Federated Learning Across Simultaneous Heterogeneities) is a lightweight and flexible client selection algorithm.<n>It outperforms state-of-the-art FL frameworks under extensive sources of Heterogeneities.<n>It achieves substantial and consistent improvements over state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-13T20:04:39Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.