A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic
Information
- URL: http://arxiv.org/abs/2308.15142v1
- Date: Tue, 29 Aug 2023 09:21:48 GMT
- Title: A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic
Information
- Authors: Shuxiao Ma and Linyuan Wang and Bin Yan
- Abstract summary: Previous visual encoding models did not incorporate verbal semantic information, contradicting biological findings.
This paper proposes a multimodal visual information encoding network model based on stimulus images and associated textual information.
Experimental results demonstrate that the proposed multimodal visual information encoding network model outperforms previous models.
- Score: 5.142858130898767
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Biological research has revealed that the verbal semantic information in the
brain cortex, as an additional source, participates in nonverbal semantic
tasks, such as visual encoding. However, previous visual encoding models did
not incorporate verbal semantic information, contradicting this biological
finding. This paper proposes a multimodal visual information encoding network
model based on stimulus images and associated textual information in response
to this issue. Our visual information encoding network model takes stimulus
images as input and leverages textual information generated by a text-image
generation model as verbal semantic information. This approach injects new
information into the visual encoding model. Subsequently, a Transformer network
aligns image and text feature information, creating a multimodal feature space.
A convolutional network then maps from this multimodal feature space to voxel
space, constructing the multimodal visual information encoding network model.
Experimental results demonstrate that the proposed multimodal visual
information encoding network model outperforms previous models under the exact
training cost. In voxel prediction of the left hemisphere of subject 1's brain,
the performance improves by approximately 15.87%, while in the right
hemisphere, the performance improves by about 4.6%. The multimodal visual
encoding network model exhibits superior encoding performance. Additionally,
ablation experiments indicate that our proposed model better simulates the
brain's visual information processing.
Related papers
- LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models [60.67899965748755]
We present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder.
Our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.
arXiv Detail & Related papers (2024-07-27T05:53:37Z) - Enhancing Vision Models for Text-Heavy Content Understanding and Interaction [0.0]
We build a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark.
The aim of the project is to increase and also enhance the advance vision models' capabilities in understanding complex visual textual data interconnected data.
arXiv Detail & Related papers (2024-05-31T15:17:47Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Modality-Agnostic fMRI Decoding of Vision and Language [4.837421245886033]
We introduce and use a new large-scale fMRI dataset (8,500 trials per subject) of people watching both images and text descriptions.
This novel dataset enables the development of modality-agnostic decoders: a single decoder that can predict which stimulus a subject is seeing.
We train and evaluate such decoders to map brain signals onto stimulus representations from a large range of publicly available vision, language and multimodal (vision+language) models.
arXiv Detail & Related papers (2024-03-18T13:30:03Z) - Multimodal Neurons in Pretrained Text-Only Transformers [52.20828443544296]
We identify "multimodal neurons" that convert visual representations into corresponding text.
We show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
arXiv Detail & Related papers (2023-08-03T05:27:12Z) - Brain encoding models based on multimodal transformers can transfer
across language and vision [60.72020004771044]
We used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies.
We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality.
arXiv Detail & Related papers (2023-05-20T17:38:44Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Visio-Linguistic Brain Encoding [3.944020612420711]
We systematically explore the efficacy of image Transformers and multi-modal Transformers for brain encoding.
We find that VisualBERT, a multi-modal Transformer, significantly outperforms previously proposed single-mode CNNs.
The supremacy of visio-linguistic models raises the question of whether the responses elicited in the visual regions are affected implicitly by linguistic processing.
arXiv Detail & Related papers (2022-04-18T11:28:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.