Related papers: MindGPT: Interpreting What You See with Non-invasive Brain Recordings

MindGPT: Interpreting What You See with Non-invasive Brain Recordings

URL: http://arxiv.org/abs/2309.15729v1
Date: Wed, 27 Sep 2023 15:35:20 GMT
Title: MindGPT: Interpreting What You See with Non-invasive Brain Recordings
Authors: Jiaxuan Chen, Yu Qi, Yueming Wang, Gang Pan
Abstract summary: We introduce a non-invasive neural decoder, termed as MindGPT, which interprets perceived visual stimuli into natural languages from fMRI signals. Our experiments show that the generated word sequences truthfully represented the visual information conveyed in the seen stimuli.
Score: 24.63828455553959
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Decoding of seen visual contents with non-invasive brain recordings has important scientific and practical values. Efforts have been made to recover the seen images from brain signals. However, most existing approaches cannot faithfully reflect the visual contents due to insufficient image quality or semantic mismatches. Compared with reconstructing pixel-level visual images, speaking is a more efficient and effective way to explain visual information. Here we introduce a non-invasive neural decoder, termed as MindGPT, which interprets perceived visual stimuli into natural languages from fMRI signals. Specifically, our model builds upon a visually guided neural encoder with a cross-attention mechanism, which permits us to guide latent neural representations towards a desired language semantic direction in an end-to-end manner by the collaborative use of the large language model GPT. By doing so, we found that the neural representations of the MindGPT are explainable, which can be used to evaluate the contributions of visual properties to language semantics. Our experiments show that the generated word sequences truthfully represented the visual information (with essential details) conveyed in the seen stimuli. The results also suggested that with respect to language decoding tasks, the higher visual cortex (HVC) is more semantically informative than the lower visual cortex (LVC), and using only the HVC can recover most of the semantic information. The code of the MindGPT model will be publicly available at https://github.com/JxuanC/MindGPT.

Related papers

DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding [82.91021399231184]
Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. We propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video.
arXiv Detail & Related papers (2025-04-01T05:28:37Z)
From Eye to Mind: brain2text Decoding Reveals the Neural Mechanisms of Visual Semantic Processing [0.3069335774032178]
We introduce a paradigm shift by directly decoding fMRI signals into textual descriptions of viewed natural images. Our novel deep learning model, trained without visual input, achieves state-of-the-art semantic decoding performance. Neuroanatomical analysis reveals the critical role of higher-level visual regions, including MT+, ventral stream visual cortex, and inferior parietal cortex.
arXiv Detail & Related papers (2025-03-15T07:28:02Z)
LaVCa: LLM-assisted Visual Cortex Captioning [2.8265531928694116]
Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. We propose LLM-assisted Visual Cortex Captioning (LaVCa) to generate captions for images to which voxels are selective.
arXiv Detail & Related papers (2025-02-19T10:37:04Z)
Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs) We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context. We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z)
Decoding Visual Experience and Mapping Semantics through Whole-Brain Analysis Using fMRI Foundation Models [10.615012396285337]
We develop algorithms to enhance our understanding of visual processes by incorporating whole-brain activation maps. We first compare our method with state-of-the-art approaches to decoding visual processing and show improved predictive semantic accuracy by 43%.
arXiv Detail & Related papers (2024-11-11T16:51:17Z)
Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics [8.749640179057469]
We use linear decoding over the learned representations of unimodal vision, unimodal language, and multimodal deep neural network (DNN) models to predict human beauty ratings of naturalistic images. We show that unimodal vision models (e.g. SimCLR) account for the vast majority of explainable variance in these ratings. Language-aligned vision models (e.g. SLIP) yield small gains relative to unimodal vision. Taken together, these results suggest that whatever words we may eventually find to describe our experience of beauty, the ineffable computations of feedforward perception may provide sufficient foundation for that experience.
arXiv Detail & Related papers (2024-10-31T03:37:21Z)
BrainDecoder: Style-Based Visual Decoding of EEG Signals [2.1611379205310506]
Decoding neural representations of visual stimuli from electroencephalography (EEG) offers valuable insights into brain activity and cognition. Recent advancements in deep learning have significantly enhanced the field of visual decoding of EEG. We present a novel visual decoding pipeline that emphasizes the reconstruction of the style, such as color and texture, of images viewed by the subject.
arXiv Detail & Related papers (2024-09-09T02:14:23Z)
Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs [52.497823009176074]
Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations. We introduce Visual Description Grounded Decoding (VDGD), a training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs.
arXiv Detail & Related papers (2024-05-24T16:21:59Z)
Saliency Suppressed, Semantics Surfaced: Visual Transformations in Neural Networks and the Brain [0.0]
We take inspiration from neuroscience to shed light on how neural networks encode information at low (visual saliency) and high (semantic similarity) levels of abstraction. We find that ResNets are more sensitive to saliency information than ViTs, when trained with object classification objectives. We show that semantic encoding is a key factor in aligning AI with human visual perception, while saliency suppression is a non-brain-like strategy.
arXiv Detail & Related papers (2024-04-29T15:05:42Z)
Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images. Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z)
DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks. In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned. We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z)
Multimodal Neurons in Pretrained Text-Only Transformers [52.20828443544296]
We identify "multimodal neurons" that convert visual representations into corresponding text. We show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
arXiv Detail & Related papers (2023-08-03T05:27:12Z)
Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions. We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z)
Controlled Caption Generation for Images Through Adversarial Attacks [85.66266989600572]
We study adversarial examples for vision and language models, which typically adopt a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation. In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network. We propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN.
arXiv Detail & Related papers (2021-07-07T07:22:41Z)
Neural encoding with visual attention [17.020869686284165]
We propose a novel approach to neural encoding by including a trainable soft-attention module. We find that attention locations estimated by the model on independent data agree well with the corresponding eye fixation patterns.
arXiv Detail & Related papers (2020-10-01T16:04:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.