Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)
- URL: http://arxiv.org/abs/2510.06235v1
- Date: Thu, 02 Oct 2025 15:24:16 GMT
- Title: Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)
- Authors: Robert Scholz, Kunal Bagga, Christine Ahrends, Carlo Alberto Barbano,
- Abstract summary: We present our submission to the Algonauts 2025 Challenge.<n>The goal is to predict fMRI brain responses to movie stimuli.<n>Our approach integrates multimodal representations from large language models, video encoders, audio models, and vision-language models.
- Score: 1.7266027274320124
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present our submission to the Algonauts 2025 Challenge, where the goal is to predict fMRI brain responses to movie stimuli. Our approach integrates multimodal representations from large language models, video encoders, audio models, and vision-language models, combining both off-the-shelf and fine-tuned variants. To improve performance, we enhanced textual inputs with detailed transcripts and summaries, and we explored stimulus-tuning and fine-tuning strategies for language and vision models. Predictions from individual models were combined using stacked regression, yielding solid results. Our submission, under the team name Seinfeld, ranked 10th. We make all code and resources publicly available, contributing to ongoing efforts in developing multimodal encoding models for brain activity.
Related papers
- Can World Models Benefit VLMs for World Dynamics? [59.73433292793044]
We investigate the capabilities when world model priors are transferred into Vision-Language Models.<n>We name our best-performing variant Dynamic Vision Aligner (DyVA)<n>We find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance.
arXiv Detail & Related papers (2025-10-01T13:07:05Z) - TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction [7.864304771129752]
TRIBE is the first deep neural network trained to predict brain responses to stimuli across multiple modalities.<n>Our model can precisely model the spatial and temporal fMRI responses to videos.<n>Our approach paves the way towards building an integrative model of representations in the human brain.
arXiv Detail & Related papers (2025-07-29T20:52:31Z) - Predicting Brain Responses To Natural Movies With Multimodal LLMs [0.881196878143281]
We present MedARC's team solution to the Algonauts 2025 challenge.<n>Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5- Omni)<n>Our final submission achieved a mean Pearson's correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition.
arXiv Detail & Related papers (2025-07-26T13:57:08Z) - A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli [0.0]
The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies.<n>We propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs.
arXiv Detail & Related papers (2025-07-24T05:29:37Z) - MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO [87.52631406241456]
Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks.<n>We introduce Mind Omni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation through reinforcement learning.
arXiv Detail & Related papers (2025-05-19T12:17:04Z) - Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models [7.511284868070148]
We investigate whether integration of visuo-linguistic information leads to representations that are more aligned with human brain activity.<n>Our findings indicate an advantage of multimodal models in predicting human brain activations.
arXiv Detail & Related papers (2024-07-25T10:08:37Z) - Revealing Vision-Language Integration in the Brain with Multimodal Networks [21.88969136189006]
We use (multi) deep neural networks (DNNs) to probe for sites of multimodal integration in the human brain by predicting stereoencephalography (SEEG) recordings taken while human subjects watched movies.
We operationalize sites of multimodal integration as regions where a multimodal vision-language model predicts recordings better than unimodal language, unimodal vision, or linearly-integrated language-vision models.
arXiv Detail & Related papers (2024-06-20T16:43:22Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Brain encoding models based on multimodal transformers can transfer
across language and vision [60.72020004771044]
We used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies.
We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality.
arXiv Detail & Related papers (2023-05-20T17:38:44Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.