Decoding fMRI Data into Captions using Prefix Language Modeling
- URL: http://arxiv.org/abs/2501.02570v1
- Date: Sun, 05 Jan 2025 15:06:25 GMT
- Title: Decoding fMRI Data into Captions using Prefix Language Modeling
- Authors: Vyacheslav Shen, Kassymzhomart Kunanbayev, Dae-Shik Kim,
- Abstract summary: We present an alternative method for decoding brain signals into image captions by predicting a DINOv2 model's embedding of an image from the corresponding fMRI signal.
We also explore 3D Convolutional Neural Network mapping of fMRI signals to image embedding space for better accounting positional information of voxels.
- Score: 3.4328283704703866
- License:
- Abstract: With the advancements in Large Language and Latent Diffusion models, brain decoding has achieved remarkable results in recent years. The works on the NSD dataset, with stimuli images from the COCO dataset, leverage the embeddings from the CLIP model for image reconstruction and GIT for captioning. However, the current captioning approach introduces the challenge of potential data contamination given that the GIT model was trained on the COCO dataset. In this work, we present an alternative method for decoding brain signals into image captions by predicting a DINOv2 model's embedding of an image from the corresponding fMRI signal and then providing its [CLS] token as the prefix to the GPT-2 language model which decreases computational requirements considerably. Additionally, instead of commonly used Linear Regression, we explore 3D Convolutional Neural Network mapping of fMRI signals to image embedding space for better accounting positional information of voxels.
Related papers
- Cover-separable Fixed Neural Network Steganography via Deep Generative Models [37.08937194546323]
We propose a Cover-separable Fixed Neural Network Steganography, namely Cs-FNNS.
In Cs-FNNS, we propose a Steganographic Perturbation Search (SPS) algorithm to directly encode the secret data into an imperceptible perturbation.
We demonstrate the superior performance of the proposed method in terms of visual quality and undetectability.
arXiv Detail & Related papers (2024-07-16T05:47:06Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation
for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations.
Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation.
Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z) - CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training [6.292642131180376]
In this paper, we tackle the lack of image-text data in chest X-ray by expanding image-label pair as image-text pair via general prompt.
We also design two contrastive losses, named ICL and TCL, for learning study-level characteristics of medical images and reports.
Our model outperforms the state-of-the-art models trained under the same conditions.
arXiv Detail & Related papers (2023-10-20T05:44:55Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - SD-RSIC: Summarization Driven Deep Remote Sensing Image Captioning [0.0]
We present a novel Summarization Driven Remote Sensing Image Captioning (SD-RSIC) approach.
The proposed approach consists of three main steps. The first step obtains the standard image captions by jointly exploiting convolutional neural networks (CNNs) with long short-term memory (LSTM) networks.
The second step summarizes the ground-truth captions of each training image into a single caption by exploiting sequence to sequence neural networks and eliminates the redundancy present in the training set.
The third step automatically defines the adaptive weights associated to each RS image to combine the standard captions with the summarized captions based on the semantic
arXiv Detail & Related papers (2020-06-15T14:29:12Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.