Aligned with LLM: a new multi-modal training paradigm for encoding fMRI
activity in visual cortex
- URL: http://arxiv.org/abs/2401.03851v1
- Date: Mon, 8 Jan 2024 12:30:23 GMT
- Title: Aligned with LLM: a new multi-modal training paradigm for encoding fMRI
activity in visual cortex
- Authors: Shuxiao Ma, Linyuan Wang, Senbao Hou, Bin Yan
- Abstract summary: Recently, there has been a surge in the popularity of pre trained large language models (LLMs)
This paper proposes a new multi-modal training paradigm, aligning with LLM, encoding fMRI activity in visual cortex.
- Score: 4.57590454144072
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, there has been a surge in the popularity of pre trained large
language models (LLMs) (such as GPT-4), sweeping across the entire Natural
Language Processing (NLP) and Computer Vision (CV) communities. These LLMs have
demonstrated advanced multi-modal understanding capabilities and showcased
strong performance across various benchmarks. The LLM has started to embody
traits of artificial general intelligence, which holds vital guidance for
enhancing brain-like characteristics within visual encoding models. Hence, This
paper proposes a new multi-modal training paradigm, aligning with LLM, for
encoding fMRI activity in visual cortex. Based on this paradigm, we trained an
encoding model in fMRI data named the LLM-Visual Encoding Model (LLM-VEM).
Specifically, we utilize LLM (miniGPT4) to generate descriptive text for all
stimulus images, forming a high-quality textual description set. Moreover, we
use the pre-trained text encoder (CLIP) to process these detailed descriptions,
obtaining the text embedding features. Next, we use the contrast loss function
to minimize the distance between the image embedding features and the text
embedding features to complete the alignment operation of the stimulus image
and text information. With the assistance of the pre-trained LLM, this
alignment process facilitates better learning of the visual encoding model,
resulting in higher precision. The final experimental results indicate that our
training paradigm has significantly aided in enhancing the performance of the
visual encoding model.
Related papers
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language Translation [28.648494997132925]
We propose Factorized Learning assisted with Large Language Model (FLa-LLM) for gloss-free Sign Language Translation (SLT)
We factorize the training process into two stages. In the visual initialing stage, we employ a lightweight translation model after the visual encoder to pre-train the visual encoder.
In the LLM fine-tuning stage, we freeze the acquired knowledge in the visual encoder and integrate it with a pre-trained LLM to inspire the LLM's translation potential.
arXiv Detail & Related papers (2024-03-19T09:00:23Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - Frozen Transformers in Language Models Are Effective Visual Encoder Layers [26.759544759745648]
Large language models (LLMs) are surprisingly strong encoders for purely visual tasks in the absence of language.
Our work pushes the boundaries of leveraging LLMs for computer vision tasks.
We propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding.
arXiv Detail & Related papers (2023-10-19T17:59:05Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation [51.08810811457617]
vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO.
We develop a method for instruction-tuning an LLM only on text to gain vision-language capabilities for medical images.
Our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks.
arXiv Detail & Related papers (2023-05-19T07:44:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.