Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding
- URL: http://arxiv.org/abs/2507.19052v1
- Date: Fri, 25 Jul 2025 08:12:26 GMT
- Title: Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding
- Authors: Hamid Abdollahi, Amir Hossein Mansouri Majoumerd, Amir Hossein Bagheri Baboukani, Amir Abolfazl Suratgar, Mohammad Bagher Menhaj,
- Abstract summary: We develop brain encoding models using state-of-the-art visual (X-CLIP) and auditory (Whisper) feature extractors.<n>We rigorously evaluate them on both in-distribution (ID) and diverse out-of-distribution (OOD) data.
- Score: 1.2233362977312945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predicting brain activity in response to naturalistic, multimodal stimuli is a key challenge in computational neuroscience. While encoding models are becoming more powerful, their ability to generalize to truly novel contexts remains a critical, often untested, question. In this work, we developed brain encoding models using state-of-the-art visual (X-CLIP) and auditory (Whisper) feature extractors and rigorously evaluated them on both in-distribution (ID) and diverse out-of-distribution (OOD) data. Our results reveal a fundamental trade-off between model complexity and generalization: a higher-capacity attention-based model excelled on ID data, but a simpler linear model was more robust, outperforming a competitive baseline by 18\% on the OOD set. Intriguingly, we found that linguistic features did not improve predictive accuracy, suggesting that for familiar languages, neural encoding may be dominated by the continuous visual and auditory streams over redundant textual information. Spatially, our approach showed marked performance gains in the auditory cortex, underscoring the benefit of high-fidelity speech representations. Collectively, our findings demonstrate that rigorous OOD testing is essential for building robust neuro-AI models and provides nuanced insights into how model architecture, stimulus characteristics, and sensory hierarchies shape the neural encoding of our rich, multimodal world.
Related papers
- A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli [0.0]
The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies.<n>We propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs.
arXiv Detail & Related papers (2025-07-24T05:29:37Z) - Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex [5.283925904540581]
BraInCoRL uses in-context learning to predict voxelwise neural responses from few-shot examples.<n>We show that BraInCoRL consistently outperforms existing voxelwise encoder designs in a low-data regime.<n>BraInCoRL facilitates better interpretability of neural signals in higher visual cortex by attending to semantically relevant stimuli.
arXiv Detail & Related papers (2025-05-21T17:59:41Z) - Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach [4.1606197342190105]
Self-supervised language and audio models effectively predict brain responses to speech.<n>Traditional prediction models rely on linear mappings from unimodal features.<n>Here, we introduce a nonlinear, multimodal prediction model that combines audio and linguistic features from pre-trained models.
arXiv Detail & Related papers (2025-02-18T11:33:28Z) - NeuralOOD: Improving Out-of-Distribution Generalization Performance with Brain-machine Fusion Learning Framework [13.25912138698749]
We propose a novel Brain-machine Fusion Learning framework to fuse visual knowledge from CV model and prior cognitive knowledge from the human brain.
We employ a pre-trained visual neural encoding model to predict the functional Magnetic Resonance Imaging (fMRI) from visual features.
Our model outperforms the DINOv2 and baseline models on the ImageNet-1k validation dataset as well as six curated OOD datasets.
arXiv Detail & Related papers (2024-08-27T10:54:37Z) - Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals [5.283718601431859]
Invasive brain-computer interfaces with Electrocorticography (ECoG) have shown promise for high-performance speech decoding in medical applications.
We developed the Du-IN model, which extracts contextual embeddings based on region-level tokens through discrete codex-guided mask modeling.
Our model achieves state-of-the-art performance on the 61-word classification task, surpassing all baselines.
arXiv Detail & Related papers (2024-05-19T06:00:36Z) - On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model,
Data, and Training [109.9218185711916]
Aspect-based sentiment analysis (ABSA) aims at automatically inferring the specific sentiment polarities toward certain aspects of products or services behind social media texts or reviews.
We propose to enhance the ABSA robustness by systematically rethinking the bottlenecks from all possible angles, including model, data, and training.
arXiv Detail & Related papers (2023-04-19T11:07:43Z) - Multimodal foundation models are better simulators of the human brain [65.10501322822881]
We present a newly-designed multimodal foundation model pre-trained on 15 million image-text pairs.
We find that both visual and lingual encoders trained multimodally are more brain-like compared with unimodal ones.
arXiv Detail & Related papers (2022-08-17T12:36:26Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Great Truths are Always Simple: A Rather Simple Knowledge Encoder for
Enhancing the Commonsense Reasoning Capacity of Pre-Trained Models [89.98762327725112]
Commonsense reasoning in natural language is a desired ability of artificial intelligent systems.
For solving complex commonsense reasoning tasks, a typical solution is to enhance pre-trained language models(PTMs) with a knowledge-aware graph neural network(GNN) encoder.
Despite the effectiveness, these approaches are built on heavy architectures, and can't clearly explain how external knowledge resources improve the reasoning capacity of PTMs.
arXiv Detail & Related papers (2022-05-04T01:27:36Z) - WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data.
We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z) - Rethinking Generalization of Neural Models: A Named Entity Recognition
Case Study [81.11161697133095]
We take the NER task as a testbed to analyze the generalization behavior of existing models from different perspectives.
Experiments with in-depth analyses diagnose the bottleneck of existing neural NER models.
As a by-product of this paper, we have open-sourced a project that involves a comprehensive summary of recent NER papers.
arXiv Detail & Related papers (2020-01-12T04:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.