Related papers: A Concept-Based Explainability Framework for Large Multimodal Models

A Concept-Based Explainability Framework for Large Multimodal Models

URL: http://arxiv.org/abs/2406.08074v1
Date: Wed, 12 Jun 2024 10:48:53 GMT
Title: A Concept-Based Explainability Framework for Large Multimodal Models
Authors: Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Alasdair Newson, Matthieu Cord,
Abstract summary: We propose a dictionary learning based approach, applied to the representation of tokens. We show that these concepts are well semantically grounded in both vision and text. We show that the extracted multimodal concepts are useful to interpret representations of test samples.
Score: 52.37626977572413
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as "multi-modal concepts". We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. We will publicly release our code.

Related papers

Can Large Vision-Language Models Understand Multimodal Sarcasm? [14.863320201956963]
Sarcasm is a complex linguistic phenomenon that involves a disparity between literal and intended meanings.<n>We evaluate Large Visual Language Models (LVLMs) in Multimodal Sarcasm Analysis (MSA) tasks.<n>We propose a training-free framework that integrates in-depth object extraction and external conceptual knowledge.
arXiv Detail & Related papers (2025-08-05T17:05:11Z)
V2C-CBM: Building Concept Bottlenecks with Vision-to-Concept Tokenizer [19.177297480709512]
Concept Bottleneck Models (CBMs) offer inherent interpretability by translating images into human-comprehensible concepts. Recent approaches have leveraged the knowledge of large language models to construct concept bottlenecks. In this study, we investigate to avoid these issues by constructing CBMs directly from multimodal models.
arXiv Detail & Related papers (2025-01-09T05:12:38Z)
Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment [53.90425382758605]
We show how fine-tuning alters the internal structure of a model to specialize in new multimodal tasks. Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks.
arXiv Detail & Related papers (2025-01-06T13:37:13Z)
Large Concept Models: Language Modeling in a Sentence Representation Space [62.73366944266477]
We present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. We show that our model exhibits impressive zero-shot generalization performance to many languages.
arXiv Detail & Related papers (2024-12-11T23:36:20Z)
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models [55.25892137362187]
We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks.
arXiv Detail & Related papers (2024-12-08T13:45:44Z)
MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception [24.406224705072763]
Mutually Reinforced Multimodal Large Language Model (MR-MLLM) is a novel framework that enhances visual perception and multimodal comprehension. First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models. Second, we propose the perception-enhanced cross-modal integration method, incorporating novel modalities from vision perception outputs.
arXiv Detail & Related papers (2024-06-22T07:10:36Z)
Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z)
M^2ConceptBase: A Fine-Grained Aligned Concept-Centric Multimodal Knowledge Base [61.53959791360333]
We introduce M2ConceptBase, the first concept-centric multimodal knowledge base (MMKB) We propose a context-aware approach to align concept-image and concept-description pairs using context information from image-text datasets. Human studies confirm more than 95% alignment accuracy, underscoring its quality.
arXiv Detail & Related papers (2023-12-16T11:06:11Z)
Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation. We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z)
Towards Concept-Aware Large Language Models [56.48016300758356]
Concepts play a pivotal role in various human cognitive functions, including learning, reasoning and communication. There is very little work on endowing machines with the ability to form and reason with concepts. In this work, we analyze how well contemporary large language models (LLMs) capture human concepts and their structure.
arXiv Detail & Related papers (2023-11-03T12:19:22Z)
The Hidden Language of Diffusion Models [70.03691458189604]
We present Conceptor, a novel method to interpret the internal representation of a textual concept by a diffusion model. We find surprising visual connections between concepts, that transcend their textual semantics. We additionally discover concepts that rely on mixtures of exemplars, biases, renowned artistic styles, or a simultaneous fusion of multiple meanings.
arXiv Detail & Related papers (2023-06-01T17:57:08Z)
GlanceNets: Interpretabile, Leak-proof Concept-based Models [23.7625973884849]
Concept-based models (CBMs) combine high-performance and interpretability by acquiring and reasoning with a vocabulary of high-level concepts. We provide a clear definition of interpretability in terms of alignment between the model's representation and an underlying data generation process. We introduce GlanceNets, a new CBM that exploits techniques from disentangled representation learning and open-set recognition to achieve alignment.
arXiv Detail & Related papers (2022-05-31T08:53:53Z)
Discovering Latent Concepts Learned in BERT [21.760620298330235]
We study what latent concepts exist in the pre-trained BERT model. We also release a novel BERT ConceptNet dataset (BCN) consisting of 174 concept labels and 1M annotated instances.
arXiv Detail & Related papers (2022-05-15T09:45:34Z)
A First Look: Towards Explainable TextVQA Models via Visual and Textual Explanations [3.7638008383533856]
We propose MTXNet, an end-to-end trainable multimodal architecture to generate multimodal explanations. We show that training with multimodal explanations surpasses unimodal baselines by up to 7% in CIDEr scores and 2% in IoU. We also describe a real-world e-commerce application for using the generated multimodal explanations.
arXiv Detail & Related papers (2021-04-29T00:36:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.