Related papers: Visio-Linguistic Brain Encoding

Visio-Linguistic Brain Encoding

URL: http://arxiv.org/abs/2204.08261v1
Date: Mon, 18 Apr 2022 11:28:18 GMT
Title: Visio-Linguistic Brain Encoding
Authors: Subba Reddy Oota, Jashn Arora, Vijay Rowtula, Manish Gupta, Raju S. Bapi
Abstract summary: We systematically explore the efficacy of image Transformers and multi-modal Transformers for brain encoding. We find that VisualBERT, a multi-modal Transformer, significantly outperforms previously proposed single-mode CNNs. The supremacy of visio-linguistic models raises the question of whether the responses elicited in the visual regions are affected implicitly by linguistic processing.
Score: 3.944020612420711
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Enabling effective brain-computer interfaces requires understanding how the human brain encodes stimuli across modalities such as visual, language (or text), etc. Brain encoding aims at constructing fMRI brain activity given a stimulus. There exists a plethora of neural encoding models which study brain encoding for single mode stimuli: visual (pretrained CNNs) or text (pretrained language models). Few recent papers have also obtained separate visual and text representation models and performed late-fusion using simple heuristics. However, previous work has failed to explore: (a) the effectiveness of image Transformer models for encoding visual stimuli, and (b) co-attentive multi-modal modeling for visual and text reasoning. In this paper, we systematically explore the efficacy of image Transformers (ViT, DEiT, and BEiT) and multi-modal Transformers (VisualBERT, LXMERT, and CLIP) for brain encoding. Extensive experiments on two popular datasets, BOLD5000 and Pereira, provide the following insights. (1) To the best of our knowledge, we are the first to investigate the effectiveness of image and multi-modal Transformers for brain encoding. (2) We find that VisualBERT, a multi-modal Transformer, significantly outperforms previously proposed single-mode CNNs, image Transformers as well as other previously proposed multi-modal models, thereby establishing new state-of-the-art. The supremacy of visio-linguistic models raises the question of whether the responses elicited in the visual regions are affected implicitly by linguistic processing even when passively viewing images. Future fMRI tasks can verify this computational insight in an appropriate experimental setting.

Related papers

Modality-Agnostic fMRI Decoding of Vision and Language [4.837421245886033]
We introduce and use a new large-scale fMRI dataset (8,500 trials per subject) of people watching both images and text descriptions. This novel dataset enables the development of modality-agnostic decoders: a single decoder that can predict which stimulus a subject is seeing. We train and evaluate such decoders to map brain signals onto stimulus representations from a large range of publicly available vision, language and multimodal (vision+language) models.
arXiv Detail & Related papers (2024-03-18T13:30:03Z)
A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic Information [5.142858130898767]
Previous visual encoding models did not incorporate verbal semantic information, contradicting biological findings. This paper proposes a multimodal visual information encoding network model based on stimulus images and associated textual information. Experimental results demonstrate that the proposed multimodal visual information encoding network model outperforms previous models.
arXiv Detail & Related papers (2023-08-29T09:21:48Z)
Multimodal Neurons in Pretrained Text-Only Transformers [52.20828443544296]
We identify "multimodal neurons" that convert visual representations into corresponding text. We show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
arXiv Detail & Related papers (2023-08-03T05:27:12Z)
Brain encoding models based on multimodal transformers can transfer across language and vision [60.72020004771044]
We used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality.
arXiv Detail & Related papers (2023-05-20T17:38:44Z)
Brain Captioning: Decoding human brain activity into images and text [1.5486926490986461]
We present an innovative method for decoding brain activity into meaningful images and captions. Our approach takes advantage of cutting-edge image captioning models and incorporates a unique image reconstruction pipeline. We evaluate our methods using quantitative metrics for both generated captions and images.
arXiv Detail & Related papers (2023-05-19T09:57:19Z)
BrainCLIP: Bridging Brain and Visual-Linguistic Representation Via CLIP for Generic Natural Visual Stimulus Decoding [51.911473457195555]
BrainCLIP is a task-agnostic fMRI-based brain decoding model. It bridges the modality gap between brain activity, image, and text. BrainCLIP can reconstruct visual stimuli with high semantic fidelity.
arXiv Detail & Related papers (2023-02-25T03:28:54Z)
Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders. We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z)
Multimodal foundation models are better simulators of the human brain [65.10501322822881]
We present a newly-designed multimodal foundation model pre-trained on 15 million image-text pairs. We find that both visual and lingual encoders trained multimodally are more brain-like compared with unimodal ones.
arXiv Detail & Related papers (2022-08-17T12:36:26Z)
VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining. Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.