Identifying Multi-modal Knowledge Neurons in Pretrained Transformers via Two-stage Filtering
- URL: http://arxiv.org/abs/2503.22941v1
- Date: Sat, 29 Mar 2025 02:16:15 GMT
- Title: Identifying Multi-modal Knowledge Neurons in Pretrained Transformers via Two-stage Filtering
- Authors: Yugen Sato, Tomohiro Takagi,
- Abstract summary: We propose a method to identify neurons associated with specific knowledge using MiniGPT-4, a Transformer-based MLLM.<n>Experiments on the image caption generation task showed that our method is able to locate knowledge with higher accuracy than existing methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large language models (LLMs) have led to the development of multimodal LLMs (MLLMs) in the fields of natural language processing (NLP) and computer vision. Although these models allow for integrated visual and language understanding, they present challenges such as opaque internal processing and the generation of hallucinations and misinformation. Therefore, there is a need for a method to clarify the location of knowledge in MLLMs. In this study, we propose a method to identify neurons associated with specific knowledge using MiniGPT-4, a Transformer-based MLLM. Specifically, we extract knowledge neurons through two stages: activation differences filtering using inpainting and gradient-based filtering using GradCAM. Experiments on the image caption generation task using the MS COCO 2017 dataset, BLEU, ROUGE, and BERTScore quantitative evaluation, and qualitative evaluation using an activation heatmap showed that our method is able to locate knowledge with higher accuracy than existing methods. This study contributes to the visualization and explainability of knowledge in MLLMs and shows the potential for future knowledge editing and control.
Related papers
- Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference [78.08901120841833]
We propose a method to detect the knowledge boundary of Visual Large Language Models (VLLMs)
We show that our method successfully depicts a VLLM's knowledge boundary based on which we are able to reduce indiscriminate retrieval while maintaining or improving the performance.
arXiv Detail & Related papers (2025-02-25T09:32:08Z) - Scaling Large Vision-Language Models for Enhanced Multimodal Comprehension In Biomedical Image Analysis [0.1984949535188529]
Vision language models (VLMs) address this by incorporating a pretrained vision backbone for processing images and a cross-modal projector.<n>We developed intelligent assistants finetuned from LLaVA models to enhance multimodal understanding in low-dose radiation therapy.
arXiv Detail & Related papers (2025-01-26T02:48:01Z) - Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement [51.601916604301685]
Large language models (LLMs) generate content that can undermine trust in online discourse.
Current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-LLM collaboration.
To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content.
arXiv Detail & Related papers (2024-10-18T08:14:10Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - LLM4Brain: Training a Large Language Model for Brain Video Understanding [9.294352205183726]
We introduce an LLM-based approach for reconstructing visual-semantic information from fMRI signals elicited by video stimuli.
We employ fine-tuning techniques on an fMRI encoder equipped with adaptors to transform brain responses into latent representations aligned with the video stimuli.
In particular, we integrate self-supervised domain adaptation methods to enhance the alignment between visual-semantic information and brain responses.
arXiv Detail & Related papers (2024-09-26T15:57:08Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - How Multi-Modal LLMs Reshape Visual Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation [23.18635769949329]
Visual deep learning (VDL) systems have shown significant success in real-world applications like image recognition, object detection, and autonomous driving.<n>To evaluate the reliability of VDL, a mainstream approach is software testing, which requires diverse mutations over image semantics.<n>The rapid development of multi-modal large language models (MLLMs) has introduced revolutionary image mutation potentials through instruction-driven methods.
arXiv Detail & Related papers (2024-04-22T07:41:41Z) - Backward Lens: Projecting Language Model Gradients into the Vocabulary
Space [94.85922991881242]
We show that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes' inputs.
We then develop methods to project these gradients into vocabulary items and explore the mechanics of how new information is stored in the LMs' neurons.
arXiv Detail & Related papers (2024-02-20T09:57:08Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z) - From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
Models [36.41816380074965]
We investigate the effectiveness of different vision encoders within Large Language Models (MLLMs)
Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding.
We propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging.
arXiv Detail & Related papers (2023-10-13T02:41:55Z) - Graph Neural Prompting with Large Language Models [32.97391910476073]
Graph Neural Prompting (GNP) is a novel plug-and-play method to assist pre-trained language models in learning beneficial knowledge from knowledge graphs.
Extensive experiments on multiple datasets demonstrate the superiority of GNP on both commonsense and biomedical reasoning tasks.
arXiv Detail & Related papers (2023-09-27T06:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.