A Concept-Based Explainability Framework for Large Multimodal Models
- URL: http://arxiv.org/abs/2406.08074v3
- Date: Sat, 30 Nov 2024 10:48:21 GMT
- Title: A Concept-Based Explainability Framework for Large Multimodal Models
- Authors: Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Alasdair Newson, Matthieu Cord,
- Abstract summary: We propose a dictionary learning based approach, applied to the representation of tokens.
We show that these concepts are well semantically grounded in both vision and text.
We show that the extracted multimodal concepts are useful to interpret representations of test samples.
- Score: 52.37626977572413
- License:
- Abstract: Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as ``multi-modal concepts''. We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. Our code is publicly available at https://github.com/mshukor/xl-vlms
Related papers
- V2C-CBM: Building Concept Bottlenecks with Vision-to-Concept Tokenizer [19.177297480709512]
Concept Bottleneck Models (CBMs) offer inherent interpretability by translating images into human-comprehensible concepts.
Recent approaches have leveraged the knowledge of large language models to construct concept bottlenecks.
In this study, we investigate to avoid these issues by constructing CBMs directly from multimodal models.
arXiv Detail & Related papers (2025-01-09T05:12:38Z) - Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment [53.90425382758605]
We show how fine-tuning alters the internal structure of a model to specialize in new multimodal tasks.
Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks.
arXiv Detail & Related papers (2025-01-06T13:37:13Z) - Large Concept Models: Language Modeling in a Sentence Representation Space [62.73366944266477]
We present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept.
Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow.
We show that our model exhibits impressive zero-shot generalization performance to many languages.
arXiv Detail & Related papers (2024-12-11T23:36:20Z) - Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models [55.25892137362187]
We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs.
Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework.
We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks.
arXiv Detail & Related papers (2024-12-08T13:45:44Z) - M^2ConceptBase: A Fine-Grained Aligned Concept-Centric Multimodal Knowledge Base [61.53959791360333]
We introduce M2ConceptBase, the first concept-centric multimodal knowledge base (MMKB)
We propose a context-aware approach to align concept-image and concept-description pairs using context information from image-text datasets.
Human studies confirm more than 95% alignment accuracy, underscoring its quality.
arXiv Detail & Related papers (2023-12-16T11:06:11Z) - The Hidden Language of Diffusion Models [70.03691458189604]
We present Conceptor, a novel method to interpret the internal representation of a textual concept by a diffusion model.
We find surprising visual connections between concepts, that transcend their textual semantics.
We additionally discover concepts that rely on mixtures of exemplars, biases, renowned artistic styles, or a simultaneous fusion of multiple meanings.
arXiv Detail & Related papers (2023-06-01T17:57:08Z) - A First Look: Towards Explainable TextVQA Models via Visual and Textual
Explanations [3.7638008383533856]
We propose MTXNet, an end-to-end trainable multimodal architecture to generate multimodal explanations.
We show that training with multimodal explanations surpasses unimodal baselines by up to 7% in CIDEr scores and 2% in IoU.
We also describe a real-world e-commerce application for using the generated multimodal explanations.
arXiv Detail & Related papers (2021-04-29T00:36:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.