A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli
- URL: http://arxiv.org/abs/2507.18104v2
- Date: Fri, 25 Jul 2025 00:49:55 GMT
- Title: A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli
- Authors: Qianyi He, Yuan Chang Leong,
- Abstract summary: The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies.<n>We propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies. In this submission, we propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs. Stimulus features were extracted using pretrained models including VideoMAE, HuBERT, Qwen, and BridgeTower. The decoder integrates information from prior brain states and current stimuli via dual cross-attention mechanisms that attend to both perceptual information extracted from the stimulus as well as narrative information provided by high-level summaries of the content. One core innovation of our approach is the use of sequences of multimodal context to predict sequences of brain activity, enabling the model to capture long-range temporal structure in both stimuli and neural responses. Another is the combination of a shared encoder with partial subject-specific decoder, which leverages common representational structure across subjects while accounting for individual variability. Our model achieves strong performance on both in-distribution and out-of-distribution data, demonstrating the effectiveness of temporally-aware, multimodal sequence modeling for brain activity prediction. The code is available at https://github.com/Angelneer926/Algonauts_challenge.
Related papers
- TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction [7.864304771129752]
TRIBE is the first deep neural network trained to predict brain responses to stimuli across multiple modalities.<n>Our model can precisely model the spatial and temporal fMRI responses to videos.<n>Our approach paves the way towards building an integrative model of representations in the human brain.
arXiv Detail & Related papers (2025-07-29T20:52:31Z) - Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding [1.2233362977312945]
We develop brain encoding models using state-of-the-art visual (X-CLIP) and auditory (Whisper) feature extractors.<n>We rigorously evaluate them on both in-distribution (ID) and diverse out-of-distribution (OOD) data.
arXiv Detail & Related papers (2025-07-25T08:12:26Z) - SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments [9.786770726122436]
Current AI frameworks for brain decoding and encoding, typically train and test models within the same datasets.<n>Key obstacle to model generalisation is the degree of variability of inter-subject cortical organisation.<n>In this paper we address this through the use of surface vision transformers, which build a generalisable model of cortical functional dynamics.
arXiv Detail & Related papers (2025-01-27T20:05:17Z) - Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)<n>We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z) - MindFormer: Semantic Alignment of Multi-Subject fMRI for Brain Decoding [50.55024115943266]
We introduce a novel semantic alignment method of multi-subject fMRI signals using so-called MindFormer.
This model is specifically designed to generate fMRI-conditioned feature vectors that can be used for conditioning Stable Diffusion model for fMRI- to-image generation or large language model (LLM) for fMRI-to-text generation.
Our experimental results demonstrate that MindFormer generates semantically consistent images and text across different subjects.
arXiv Detail & Related papers (2024-05-28T00:36:25Z) - Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity [13.04953215936574]
We propose a two-stage model named Mind-Animator to reconstruct human dynamic vision from brain activity.<n>During the fMRI-to-feature stage, we decouple semantic, structure, and motion features from fMRI.<n>In the feature-to-video stage, these features are integrated into videos using an inflated Stable Diffusion.
arXiv Detail & Related papers (2024-05-06T08:56:41Z) - MindBridge: A Cross-Subject Brain Decoding Framework [60.58552697067837]
Brain decoding aims to reconstruct stimuli from acquired brain signals.
Currently, brain decoding is confined to a per-subject-per-model paradigm.
We present MindBridge, that achieves cross-subject brain decoding by employing only one model.
arXiv Detail & Related papers (2024-04-11T15:46:42Z) - Dynamics Based Neural Encoding with Inter-Intra Region Connectivity [2.3825930751052358]
We propose the first large-scale study focused on comparing video understanding models with respect to the visual cortex recordings using video stimuli.<n>We provide key insights on how video understanding models predict visual cortex responses.<n>We propose a novel neural encoding scheme that is built on top of the best performing video understanding models.
arXiv Detail & Related papers (2024-02-19T20:29:49Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Multimodal Neurons in Pretrained Text-Only Transformers [52.20828443544296]
We identify "multimodal neurons" that convert visual representations into corresponding text.
We show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
arXiv Detail & Related papers (2023-08-03T05:27:12Z) - Multimodal foundation models are better simulators of the human brain [65.10501322822881]
We present a newly-designed multimodal foundation model pre-trained on 15 million image-text pairs.
We find that both visual and lingual encoders trained multimodally are more brain-like compared with unimodal ones.
arXiv Detail & Related papers (2022-08-17T12:36:26Z) - A shared neural encoding model for the prediction of subject-specific
fMRI response [17.020869686284165]
We propose a shared convolutional neural encoding method that accounts for individual-level differences.
Our method leverages multi-subject data to improve the prediction of subject-specific responses evoked by visual or auditory stimuli.
arXiv Detail & Related papers (2020-06-29T04:10:14Z) - M2Net: Multi-modal Multi-channel Network for Overall Survival Time
Prediction of Brain Tumor Patients [151.4352001822956]
Early and accurate prediction of overall survival (OS) time can help to obtain better treatment planning for brain tumor patients.
Existing prediction methods rely on radiomic features at the local lesion area of a magnetic resonance (MR) volume.
We propose an end-to-end OS time prediction model; namely, Multi-modal Multi-channel Network (M2Net)
arXiv Detail & Related papers (2020-06-01T05:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.