A Concept-Based Explainability Framework for Large Multimodal Models
- URL: http://arxiv.org/abs/2406.08074v2
- Date: Sat, 23 Nov 2024 12:27:07 GMT
- Title: A Concept-Based Explainability Framework for Large Multimodal Models
- Authors: Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Alasdair Newson, Matthieu Cord,
- Abstract summary: We propose a dictionary learning based approach, applied to the representation of tokens.
We show that these concepts are well semantically grounded in both vision and text.
We show that the extracted multimodal concepts are useful to interpret representations of test samples.
- Score: 52.37626977572413
- License:
- Abstract: Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as ``multi-modal concepts''. We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. Our implementation is publicly available.
Related papers
- MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception [24.406224705072763]
Mutually Reinforced Multimodal Large Language Model (MR-MLLM) is a novel framework that enhances visual perception and multimodal comprehension.
First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models.
Second, we propose the perception-enhanced cross-modal integration method, incorporating novel modalities from vision perception outputs.
arXiv Detail & Related papers (2024-06-22T07:10:36Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - M^2ConceptBase: A Fine-Grained Aligned Concept-Centric Multimodal Knowledge Base [61.53959791360333]
We introduce M2ConceptBase, the first concept-centric multimodal knowledge base (MMKB)
We propose a context-aware approach to align concept-image and concept-description pairs using context information from image-text datasets.
Human studies confirm more than 95% alignment accuracy, underscoring its quality.
arXiv Detail & Related papers (2023-12-16T11:06:11Z) - Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks.
The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation.
We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z) - Towards Concept-Aware Large Language Models [56.48016300758356]
Concepts play a pivotal role in various human cognitive functions, including learning, reasoning and communication.
There is very little work on endowing machines with the ability to form and reason with concepts.
In this work, we analyze how well contemporary large language models (LLMs) capture human concepts and their structure.
arXiv Detail & Related papers (2023-11-03T12:19:22Z) - The Hidden Language of Diffusion Models [70.03691458189604]
We present Conceptor, a novel method to interpret the internal representation of a textual concept by a diffusion model.
We find surprising visual connections between concepts, that transcend their textual semantics.
We additionally discover concepts that rely on mixtures of exemplars, biases, renowned artistic styles, or a simultaneous fusion of multiple meanings.
arXiv Detail & Related papers (2023-06-01T17:57:08Z) - GlanceNets: Interpretabile, Leak-proof Concept-based Models [23.7625973884849]
Concept-based models (CBMs) combine high-performance and interpretability by acquiring and reasoning with a vocabulary of high-level concepts.
We provide a clear definition of interpretability in terms of alignment between the model's representation and an underlying data generation process.
We introduce GlanceNets, a new CBM that exploits techniques from disentangled representation learning and open-set recognition to achieve alignment.
arXiv Detail & Related papers (2022-05-31T08:53:53Z) - Discovering Latent Concepts Learned in BERT [21.760620298330235]
We study what latent concepts exist in the pre-trained BERT model.
We also release a novel BERT ConceptNet dataset (BCN) consisting of 174 concept labels and 1M annotated instances.
arXiv Detail & Related papers (2022-05-15T09:45:34Z) - A First Look: Towards Explainable TextVQA Models via Visual and Textual
Explanations [3.7638008383533856]
We propose MTXNet, an end-to-end trainable multimodal architecture to generate multimodal explanations.
We show that training with multimodal explanations surpasses unimodal baselines by up to 7% in CIDEr scores and 2% in IoU.
We also describe a real-world e-commerce application for using the generated multimodal explanations.
arXiv Detail & Related papers (2021-04-29T00:36:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.