MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model
- URL: http://arxiv.org/abs/2406.11193v2
- Date: Tue, 01 Oct 2024 17:04:22 GMT
- Title: MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model
- Authors: Jiahao Huo, Yibo Yan, Boren Hu, Yutao Yue, Xuming Hu,
- Abstract summary: We identify domain-specific neurons in multimodal large language models.
We propose a three-stage mechanism for language model modules in MLLMs when handling projected image features.
- Score: 11.91010815015959
- License:
- Abstract: Projecting visual features into word embedding space has become a significant fusion strategy adopted by Multimodal Large Language Models (MLLMs). However, its internal mechanisms have yet to be explored. Inspired by multilingual research, we identify domain-specific neurons in multimodal large language models. Specifically, we investigate the distribution of domain-specific neurons and the mechanism of how MLLMs process features from diverse domains. Furthermore, we propose a three-stage mechanism for language model modules in MLLMs when handling projected image features, and verify this hypothesis using logit lens. Extensive experiments indicate that while current MLLMs exhibit Visual Question Answering (VQA) capability, they may not fully utilize domain-specific information. Manipulating domain-specific neurons properly will result in a 10% change of accuracy at most, shedding light on the development of cross-domain, all-encompassing MLLMs in the future. The source code is available at https://github.com/Z1zs/MMNeuron.
Related papers
- Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [89.50691075011429]
Slow-thinking reasoning systems have garnered widespread attention by scaling the thinking time during inference.
There is also growing interest in adapting this capability to multimodal large language models (MLLMs)
In this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data.
We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs.
arXiv Detail & Related papers (2025-01-03T17:14:16Z) - Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models [14.594698598522797]
Demonstrating feature universality allows discoveries about latent representations to generalize across several models.
We employ a method known as dictionary learning to transform LLM activations into more interpretable spaces spanned by neurons corresponding to individual features.
Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.
arXiv Detail & Related papers (2024-10-09T15:18:57Z) - M2QA: Multi-domain Multilingual Question Answering [63.191474328757366]
Generalization and robustness to input variation are core desiderata of machine learning research.
We introduce M2QA, a multi-domain multilingual question answering benchmark.
M2QA includes 13,500 SQuAD 2.0-style question-answer instances in German, Turkish, and Chinese for the domains of product reviews, news, and creative writing.
arXiv Detail & Related papers (2024-07-01T08:48:49Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information.
In this study, we find that the intermediate layers of models can encode more global semantic information.
We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z) - Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space [22.658906986091544]
Multimodal large language models (MLLMs) enable general-purpose conversations about images with the language modality.
As off-the-shelf MLLMs may have limited capabilities on images from domains like dermatology and agriculture, they must be fine-tuned to unlock domain-specific applications.
This study offers a potential reinterpretation of the role of cross-modal projections in MLLM architectures.
arXiv Detail & Related papers (2024-02-26T18:56:48Z) - Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [117.20416338476856]
Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora.
We propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs.
Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons.
arXiv Detail & Related papers (2024-02-26T09:36:05Z) - Visual Question Answering Instruction: Unlocking Multimodal Large
Language Model To Domain-Specific Visual Multitasks [0.8192907805418583]
We develop a method to transform domain-specific visual and vision-language datasets into a unified question answering format called Visual Question Answering Instruction (VQA-IN)
The proposed method achieved a high score metric on domainspecific visual tasks while also maintaining its performance on vision-language tasks in a multitask manner.
arXiv Detail & Related papers (2024-02-13T10:40:53Z) - The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models [19.213774611556]
Multi-modal large language models (MLLMs) integrate verbal and visual information.
Despite the revolutionizing prospect of MLLMs, our understanding of their reasoning abilities is limited.
In this study, we assess the nonverbal abstract reasoning abilities of open-source and closed-source MLLMs.
arXiv Detail & Related papers (2024-01-22T16:57:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.