Related papers: Improving Multimodal Brain Encoding Model with Dynamic Subject-awareness Routing

Improving Multimodal Brain Encoding Model with Dynamic Subject-awareness Routing

URL: http://arxiv.org/abs/2510.04670v2
Date: Fri, 10 Oct 2025 06:31:12 GMT
Title: Improving Multimodal Brain Encoding Model with Dynamic Subject-awareness Routing
Authors: Xuanhua Yin, Runkai Zhao, Weidong Cai,
Abstract summary: AFIRE (Agnostic Framework for Multimodal fMRI Response) standardizes time-aligned post-fusion tokens from varied encoders.<n> MIND combines token-dependent Top-K sparse routing with a subject prior to personalize expert usage.
Score: 8.942649901923332
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Naturalistic fMRI encoding must handle multimodal inputs, shifting fusion styles, and pronounced inter-subject variability. We introduce AFIRE (Agnostic Framework for Multimodal fMRI Response Encoding), an agnostic interface that standardizes time-aligned post-fusion tokens from varied encoders, and MIND, a plug-and-play Mixture-of-Experts decoder with a subject-aware dynamic gating. Trained end-to-end for whole-brain prediction, AFIRE decouples the decoder from upstream fusion, while MIND combines token-dependent Top-K sparse routing with a subject prior to personalize expert usage without sacrificing generality. Experiments across multiple multimodal backbones and subjects show consistent improvements over strong baselines, enhanced cross-subject generalization, and interpretable expert patterns that correlate with content type. The framework offers a simple attachment point for new encoders and datasets, enabling robust, plug-and-improve performance for naturalistic neuroimaging studies.

Related papers

MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval [34.21875369884307]
Multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs.<n>While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut.<n>We propose a modality composition awareness framework to mitigate this issue.
arXiv Detail & Related papers (2025-10-17T11:20:35Z)
Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model [1.3663057923522652]
We introduce Fusion to Enhance (FtZ), a novel vision tower framework.<n>FtZ moves beyond the single-encoder design by innovatively composing a semantically powerful anchor encoder with a perception-rich augmenting encoder.<n>This work proves that composing heterogeneous expert encoders is an efficient and effective path to overcoming the visual perception bottleneck in current MLLMs.
arXiv Detail & Related papers (2025-08-31T02:22:57Z)
FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [57.577843653775]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z)
MoCA: Multi-modal Cross-masked Autoencoder for Digital Health Measurements [2.8493802389913694]
We propose the Multi-modal Cross-masked Autoencoder (MoCA), a self-supervised learning framework that combines transformer architecture with masked autoencoder (MAE) methodology.<n>MoCA demonstrates strong performance boosts across reconstruction and downstream classification tasks on diverse benchmark datasets.<n>Our approach offers a novel solution for leveraging unlabeled multi-modal wearable data while handling missing modalities, with broad applications across digital health domains.
arXiv Detail & Related papers (2025-06-02T21:07:25Z)
StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers.<n>We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding.<n>Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z)
Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation [51.80447197290866]
Multi-modal knowledge graph completion (MMKGC) aims to discover unobserved knowledge from given knowledge graphs.<n>Existing MMKGC methods usually extract multi-modal features with pre-trained models.<n>We introduce a novel framework MyGO to tokenize, fuse, and augment the fine-grained multi-modal representations of entities.
arXiv Detail & Related papers (2024-04-15T05:40:41Z)
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications. Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z)
Federated Modality-specific Encoders and Multimodal Anchors for Personalized Brain Tumor Segmentation [29.584319651813754]
Federated modality-specific encoders and multimodal anchors (FedMEMA) are proposed. FedMEMA employs an exclusive encoder for each modality to account for the inter-modal heterogeneity. FedMEMA is validated on the BraTS 2020 benchmark for multimodal brain tumor segmentation.
arXiv Detail & Related papers (2024-03-18T14:02:53Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network. A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features. The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.