MAESTRO : Adaptive Sparse Attention and Robust Learning for Multimodal Dynamic Time Series
- URL: http://arxiv.org/abs/2509.25278v1
- Date: Mon, 29 Sep 2025 03:07:06 GMT
- Title: MAESTRO : Adaptive Sparse Attention and Robust Learning for Multimodal Dynamic Time Series
- Authors: Payal Mohapatra, Yueyuan Sui, Akash Pandey, Stephen Xia, Qi Zhu,
- Abstract summary: We introduce MAESTRO, a novel framework that overcomes key limitations of existing multimodal learning approaches.<n>At its core, MAESTRO facilitates dynamic intra- and cross-modal interactions based on task relevance.<n>We evaluate MAESTRO against 10 baselines on four diverse datasets spanning three applications.
- Score: 7.657107258507061
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: From clinical healthcare to daily living, continuous sensor monitoring across multiple modalities has shown great promise for real-world intelligent decision-making but also faces various challenges. In this work, we introduce MAESTRO, a novel framework that overcomes key limitations of existing multimodal learning approaches: (1) reliance on a single primary modality for alignment, (2) pairwise modeling of modalities, and (3) assumption of complete modality observations. These limitations hinder the applicability of these approaches in real-world multimodal time-series settings, where primary modality priors are often unclear, the number of modalities can be large (making pairwise modeling impractical), and sensor failures often result in arbitrary missing observations. At its core, MAESTRO facilitates dynamic intra- and cross-modal interactions based on task relevance, and leverages symbolic tokenization and adaptive attention budgeting to construct long multimodal sequences, which are processed via sparse cross-modal attention. The resulting cross-modal tokens are routed through a sparse Mixture-of-Experts (MoE) mechanism, enabling black-box specialization under varying modality combinations. We evaluate MAESTRO against 10 baselines on four diverse datasets spanning three applications, and observe average relative improvements of 4% and 8% over the best existing multimodal and multivariate approaches, respectively, under complete observations. Under partial observations -- with up to 40% of missing modalities -- MAESTRO achieves an average 9% improvement. Further analysis also demonstrates the robustness and efficiency of MAESTRO's sparse, modality-aware design for learning from dynamic time series.
Related papers
- Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process [55.91649771370862]
Dirichlet process (DP) mixture model is a powerful non-parametric method that can amplify the most prominent features.<n>We propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment.
arXiv Detail & Related papers (2025-10-23T16:53:24Z) - MGTS-Net: Exploring Graph-Enhanced Multimodal Fusion for Augmented Time Series Forecasting [1.7077661158850292]
We propose MGTS-Net, a Multimodal Graph-enhanced Network for Time Series forecasting.<n>The model consists of three core components: (1) a Multimodal Feature Extraction layer (MFE), (2) a Multimodal Feature Fusion layer (MFF), and (3) a Multi-Scale Prediction layer (MSP)
arXiv Detail & Related papers (2025-10-18T04:47:10Z) - Attention-Driven Multimodal Alignment for Long-term Action Quality Assessment [5.262258418692889]
Long-term action quality assessment (AQA) focuses on evaluating the quality of human activities in videos lasting up to several minutes.<n>Long-term Multimodal Attention Consistency Network (LMAC-Net) introduces a multimodal attention consistency mechanism to explicitly align multimodal features.<n>Experiments conducted on the RG and Fis-V datasets demonstrate that LMAC-Net significantly outperforms existing methods.
arXiv Detail & Related papers (2025-07-29T15:58:39Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization [66.10528870853324]
Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks is critically important.<n>One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities.<n>We propose a plug-and-play regularization term based on functional entropy, which introduces no additional parameters.
arXiv Detail & Related papers (2025-05-10T12:58:15Z) - RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation [24.48561340129571]
RingMoE is a unified RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites.<n>It has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.
arXiv Detail & Related papers (2025-04-04T04:47:54Z) - Continual Multimodal Contrastive Learning [99.53621521696051]
Multimodal Contrastive Learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space.<n>However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive.<n>In this paper, we formulate CMCL through two specialized principles of stability and plasticity.<n>We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge.
arXiv Detail & Related papers (2025-03-19T07:57:08Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.