Brain encoding models based on multimodal transformers can transfer
across language and vision
- URL: http://arxiv.org/abs/2305.12248v1
- Date: Sat, 20 May 2023 17:38:44 GMT
- Title: Brain encoding models based on multimodal transformers can transfer
across language and vision
- Authors: Jerry Tang, Meng Du, Vy A. Vo, Vasudev Lal, Alexander G. Huth
- Abstract summary: We used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies.
We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality.
- Score: 60.72020004771044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Encoding models have been used to assess how the human brain represents
concepts in language and vision. While language and vision rely on similar
concept representations, current encoding models are typically trained and
tested on brain responses to each modality in isolation. Recent advances in
multimodal pretraining have produced transformers that can extract aligned
representations of concepts in language and vision. In this work, we used
representations from multimodal transformers to train encoding models that can
transfer across fMRI responses to stories and movies. We found that encoding
models trained on brain responses to one modality can successfully predict
brain responses to the other modality, particularly in cortical regions that
represent conceptual meaning. Further analysis of these encoding models
revealed shared semantic dimensions that underlie concept representations in
language and vision. Comparing encoding models trained using representations
from multimodal and unimodal transformers, we found that multimodal
transformers learn more aligned representations of concepts in language and
vision. Our results demonstrate how multimodal transformers can provide
insights into the brain's capacity for multimodal processing.
Related papers
- Vision-Language Integration in Multimodal Video Transformers (Partially)
Aligns with the Brain [5.496000639803771]
We present a promising approach for probing a pre-trained multimodal video transformer model by leveraging neuroscientific evidence of multimodal information processing in the brain.
We find evidence that vision enhances masked prediction performance during language processing, providing support that cross-modal representations in models can benefit individual modalities.
We show that the brain alignment of the pre-trained joint representation can be improved by fine-tuning using a task that requires vision-language inferences.
arXiv Detail & Related papers (2023-11-13T21:32:37Z) - A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic
Information [5.142858130898767]
Previous visual encoding models did not incorporate verbal semantic information, contradicting biological findings.
This paper proposes a multimodal visual information encoding network model based on stimulus images and associated textual information.
Experimental results demonstrate that the proposed multimodal visual information encoding network model outperforms previous models.
arXiv Detail & Related papers (2023-08-29T09:21:48Z) - Multimodal Neurons in Pretrained Text-Only Transformers [52.20828443544296]
We identify "multimodal neurons" that convert visual representations into corresponding text.
We show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
arXiv Detail & Related papers (2023-08-03T05:27:12Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments.
Our method consists of a multimodal transformer that encodes visual observations and language instructions.
We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - Multimodal foundation models are better simulators of the human brain [65.10501322822881]
We present a newly-designed multimodal foundation model pre-trained on 15 million image-text pairs.
We find that both visual and lingual encoders trained multimodally are more brain-like compared with unimodal ones.
arXiv Detail & Related papers (2022-08-17T12:36:26Z) - Visio-Linguistic Brain Encoding [3.944020612420711]
We systematically explore the efficacy of image Transformers and multi-modal Transformers for brain encoding.
We find that VisualBERT, a multi-modal Transformer, significantly outperforms previously proposed single-mode CNNs.
The supremacy of visio-linguistic models raises the question of whether the responses elicited in the visual regions are affected implicitly by linguistic processing.
arXiv Detail & Related papers (2022-04-18T11:28:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.