Related papers: Predicting Brain Responses To Natural Movies With Multimodal LLMs

Predicting Brain Responses To Natural Movies With Multimodal LLMs

URL: http://arxiv.org/abs/2507.19956v1
Date: Sat, 26 Jul 2025 13:57:08 GMT
Title: Predicting Brain Responses To Natural Movies With Multimodal LLMs
Authors: Cesar Kadir Torrico Villanueva, Jiaxin Cindy Tu, Mihir Tripathy, Connor Lane, Rishab Iyer, Paul S. Scotti,
Abstract summary: We present MedARC's team solution to the Algonauts 2025 challenge.<n>Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5- Omni)<n>Our final submission achieved a mean Pearson's correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition.
Score: 0.881196878143281
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present MedARC's team solution to the Algonauts 2025 challenge. Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5-Omni). These features extracted from the models were linearly projected to a latent space, temporally aligned to the fMRI time series, and finally mapped to cortical parcels through a lightweight encoder comprising a shared group head plus subject-specific residual heads. We trained hundreds of model variants across hyperparameter settings, validated them on held-out movies and assembled ensembles targeted to each parcel in each subject. Our final submission achieved a mean Pearson's correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition. We further discuss a last-minute optimization that would have raised us to second place. Our results highlight how combining features from models trained in different modalities, using a simple architecture consisting of shared-subject and single-subject components, and conducting comprehensive model selection and ensembling improves generalization of encoding models to novel movie stimuli. All code is available on GitHub.

Related papers

Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report) [1.7266027274320124]
We present our submission to the Algonauts 2025 Challenge.<n>The goal is to predict fMRI brain responses to movie stimuli.<n>Our approach integrates multimodal representations from large language models, video encoders, audio models, and vision-language models.
arXiv Detail & Related papers (2025-10-02T15:24:16Z)
Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025) [0.0]
We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series.<n>Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory robustness to late association regions.<n>The approach establishes a simple, naturalistic baseline for future multimodal brain-encoding benchmarks.
arXiv Detail & Related papers (2025-07-23T19:48:27Z)
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks [3.466119510238668]
Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio visual content.<n>Existing video captioning benchmarks and models remain predominantly visual centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context.<n>To address these challenges, we introduce-VideoCap, a new benchmark and model framework specifically designed for detailed omnimodal captioning of short form user-generated videos.
arXiv Detail & Related papers (2025-07-15T14:08:29Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis [13.702423348269155]
Video-Text to Speech (VTTS) is a speech generation task conditioned on both its corresponding text and video of talking people.<n>We introduce Visatronic, a unified multimodal decoder-only transformer model that embeds visual, textual, and speech inputs into a shared subspace.<n>We show that Visatronic achieves a 4.5% WER, outperforming prior SOTA methods trained only on LRS3.
arXiv Detail & Related papers (2024-11-26T18:57:29Z)
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens. DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z)
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies. Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks. Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z)
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action. We train our model from scratch on a large multimodal pre-training corpus from diverse sources. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z)
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining. It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer. HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.