LaVCa: LLM-assisted Visual Cortex Captioning
- URL: http://arxiv.org/abs/2502.13606v1
- Date: Wed, 19 Feb 2025 10:37:04 GMT
- Title: LaVCa: LLM-assisted Visual Cortex Captioning
- Authors: Takuya Matsuyama, Shinji Nishimoto, Yu Takagi,
- Abstract summary: Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity.
We propose LLM-assisted Visual Cortex Captioning (LaVCa) to generate captions for images to which voxels are selective.
- Score: 2.8265531928694116
- License:
- Abstract: Understanding the property of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that uses large language models (LLMs) to generate natural-language captions for images to which voxels are selective. By applying LaVCa for image-evoked brain activity, we demonstrate that LaVCa generates captions that describe voxel selectivity more accurately than the previously proposed method. Furthermore, the captions generated by LaVCa quantitatively capture more detailed properties than the existing method at both the inter-voxel and intra-voxel levels. Furthermore, a more detailed analysis of the voxel-specific properties generated by LaVCa reveals fine-grained functional differentiation within regions of interest (ROIs) in the visual cortex and voxels that simultaneously represent multiple distinct concepts. These findings offer profound insights into human visual representations by assigning detailed captions throughout the visual cortex while highlighting the potential of LLM-based methods in understanding brain representations. Please check out our webpage at https://sites.google.com/view/lavca-llm/
Related papers
- Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [66.71616369573715]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input.
We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z) - Brain Mapping with Dense Features: Grounding Cortical Semantic Selectivity in Natural Images With Vision Transformers [5.265058307999745]
We introduce BrainSAIL, a method for isolating neurally-activating visual concepts in images.
BrainSAIL exploits semantically consistent, dense spatial features from pre-trained vision models.
We validate BrainSAIL on cortical regions with known category selectivity.
arXiv Detail & Related papers (2024-10-07T17:59:45Z) - MindSemantix: Deciphering Brain Visual Experiences with a Brain-Language Model [45.18716166499859]
Deciphering the human visual experience through brain activities captured by fMRI represents a compelling and cutting-edge challenge.
We introduce MindSemantix, a novel multi-modal framework that enables LLMs to comprehend visually-evoked semantic content in brain activity.
MindSemantix generates high-quality captions that are deeply rooted in the visual and semantic information derived from brain activity.
arXiv Detail & Related papers (2024-05-29T06:55:03Z) - Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [71.93366651585275]
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks.
We propose Visualization-of-Thought (VoT) to elicit spatial reasoning of LLMs by visualizing their reasoning traces.
VoT significantly enhances the spatial reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-04-04T17:45:08Z) - Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity [6.285481522918523]
We introduce a data-driven method that generates natural language descriptions for images predicted to maximally activate individual voxels of interest.
We validate our method through fine-grained voxel-level captioning across higher-order visual regions.
To demonstrate how our method enables scientific discovery, we perform exploratory investigations on the distribution of "person" representations in the brain.
arXiv Detail & Related papers (2023-10-06T17:59:53Z) - MindGPT: Interpreting What You See with Non-invasive Brain Recordings [24.63828455553959]
We introduce a non-invasive neural decoder, termed as MindGPT, which interprets perceived visual stimuli into natural languages from fMRI signals.
Our experiments show that the generated word sequences truthfully represented the visual information conveyed in the seen stimuli.
arXiv Detail & Related papers (2023-09-27T15:35:20Z) - DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks.
In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned.
We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z) - Visual representations in the human brain are aligned with large language models [7.779248296336383]
We show that large language models (LLMs) are beneficial for modelling the complex visual information extracted by the brain from natural scenes.
We then train deep neural network models to transform image inputs into LLM representations.
arXiv Detail & Related papers (2022-09-23T17:34:33Z) - Low-Dimensional Structure in the Space of Language Representations is
Reflected in Brain Responses [62.197912623223964]
We show a low-dimensional structure where language models and translation models smoothly interpolate between word embeddings, syntactic and semantic tasks, and future word embeddings.
We find that this representation embedding can predict how well each individual feature space maps to human brain responses to natural language stimuli recorded using fMRI.
This suggests that the embedding captures some part of the brain's natural language representation structure.
arXiv Detail & Related papers (2021-06-09T22:59:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.