Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)
- URL: http://arxiv.org/abs/2507.17897v2
- Date: Fri, 25 Jul 2025 15:38:12 GMT
- Title: Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)
- Authors: Semih Eren, Deniz Kucukahmetler, Nico Scherf,
- Abstract summary: We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series.<n>Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory robustness to late association regions.<n>The approach establishes a simple, naturalistic baseline for future multimodal brain-encoding benchmarks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.
Related papers
- Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline [58.585692088008905]
MM-Lifelong is a dataset designed for Multimodal Lifelong Understanding.<n>Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities.
arXiv Detail & Related papers (2026-03-05T18:52:12Z) - OmniRet: Efficient and High-Fidelity Omni Modality Retrieval [51.80205678389465]
We present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio.<n>Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others.
arXiv Detail & Related papers (2026-03-02T17:19:55Z) - Digital FAST: An AI-Driven Multimodal Framework for Rapid and Early Stroke Screening [0.7136933021609076]
This study presents a fast, non-invasive multimodal deep learning framework for automatic binary stroke screening based on data collected during the F.A.S.T. assessment.<n>The proposed approach integrates complementary information from facial expressions, speech signals, and upper-body movements to enhance diagnostic robustness.
arXiv Detail & Related papers (2026-01-17T03:35:39Z) - TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction [7.864304771129752]
TRIBE is the first deep neural network trained to predict brain responses to stimuli across multiple modalities.<n>Our model can precisely model the spatial and temporal fMRI responses to videos.<n>Our approach paves the way towards building an integrative model of representations in the human brain.
arXiv Detail & Related papers (2025-07-29T20:52:31Z) - Predicting Brain Responses To Natural Movies With Multimodal LLMs [0.881196878143281]
We present MedARC's team solution to the Algonauts 2025 challenge.<n>Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5- Omni)<n>Our final submission achieved a mean Pearson's correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition.
arXiv Detail & Related papers (2025-07-26T13:57:08Z) - The ISLab Solution to the Algonauts Challenge 2025: A Multimodal Deep Learning Approach to Brain Response Prediction [7.293664607999047]
We present a network-specific approach for predicting brain responses to complex multimodal movies.<n>We grouped the seven functional networks into four clusters and trained separate multi-subject, multi-layer perceptron (MLP) models for each.
arXiv Detail & Related papers (2025-07-25T10:21:06Z) - A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli [0.0]
The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies.<n>We propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs.
arXiv Detail & Related papers (2025-07-24T05:29:37Z) - MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning [54.47710436807661]
MORSE-500 is a video benchmark composed of 500 fully scripted clips embedded questions spanning six complementary reasoning categories.<n>Each instance is generated using deterministic Python scripts (Manim, Matplotlib, MoviePy), generative video models, and real footage.<n>Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve.
arXiv Detail & Related papers (2025-06-05T19:12:45Z) - Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge [102.84031769492708]
This task defines three QA subsets to test audio-language models on interactive question-answering over diverse acoustic scenes.<n>Preliminary results on the development set are compared, showing strong variation across models and subsets.<n>This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity.
arXiv Detail & Related papers (2025-05-12T09:04:16Z) - RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation [24.48561340129571]
RingMoE is a unified RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites.<n>It has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.
arXiv Detail & Related papers (2025-04-04T04:47:54Z) - The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor Recognition [64.5207572897806]
The Multimodal Sentiment Analysis Challenge (MuSe) 2024 addresses two contemporary multimodal affect and sentiment analysis problems.
In the Social Perception Sub-Challenge (MuSe-Perception), participants will predict 16 different social attributes of individuals.
The Cross-Cultural Humor Detection Sub-Challenge (MuSe-Humor) dataset expands upon the Passau Spontaneous Football Coach Humor dataset.
arXiv Detail & Related papers (2024-06-11T22:26:20Z) - Decoding speech perception from non-invasive brain recordings [48.46819575538446]
We introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from non-invasive recordings.
Our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities.
arXiv Detail & Related papers (2022-08-25T10:01:43Z) - Hybrid Mutimodal Fusion for Dimensional Emotion Recognition [20.512310175499664]
We extensively present our solutions for the MuSe-Stress sub-challenge and the MuSe-Physio sub-challenge of Multimodal Sentiment Challenge (MuSe) 2021.
For the MuSe-Stress sub-challenge, we highlight our solutions in three aspects: 1) the audio-visual features and the bio-signal features are used for emotional state recognition.
For the MuSe-Physio sub-challenge, we first extract the audio-visual features and the bio-signal features from multiple modalities.
arXiv Detail & Related papers (2021-10-16T06:57:18Z) - Deep Recurrent Encoder: A scalable end-to-end network to model brain
signals [122.1055193683784]
We propose an end-to-end deep learning architecture trained to predict the brain responses of multiple subjects at once.
We successfully test this approach on a large cohort of magnetoencephalography (MEG) recordings acquired during a one-hour reading task.
arXiv Detail & Related papers (2021-03-03T11:39:17Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.