DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations
- URL: http://arxiv.org/abs/2203.02013v1
- Date: Thu, 3 Mar 2022 20:52:47 GMT
- Title: DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations
- Authors: Yiwei Lyu, Paul Pu Liang, Zihao Deng, Ruslan Salakhutdinov,
Louis-Philippe Morency
- Abstract summary: We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
- Score: 119.1953397679783
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability for a human to understand an Artificial Intelligence (AI) model's
decision-making process is critical in enabling stakeholders to visualize model
behavior, perform model debugging, promote trust in AI models, and assist in
collaborative human-AI decision-making. As a result, the research fields of
interpretable and explainable AI have gained traction within AI communities as
well as interdisciplinary scientists seeking to apply AI in their subject
areas. In this paper, we focus on advancing the state-of-the-art in
interpreting multimodal models - a class of machine learning methods that
tackle core challenges in representing and capturing interactions between
heterogeneous data sources such as images, text, audio, and time-series data.
Multimodal models have proliferated numerous real-world applications across
healthcare, robotics, multimedia, affective computing, and human-computer
interaction. By performing model disentanglement into unimodal contributions
(UC) and multimodal interactions (MI), our proposed approach, DIME, enables
accurate and fine-grained analysis of multimodal models while maintaining
generality across arbitrary modalities, model architectures, and tasks. Through
a comprehensive suite of experiments on both synthetic and real-world
multimodal tasks, we show that DIME generates accurate disentangled
explanations, helps users of multimodal models gain a deeper understanding of
model behavior, and presents a step towards debugging and improving these
models for real-world deployment. Code for our experiments can be found at
https://github.com/lvyiwei1/DIME.
Related papers
- COLLAGE: Collaborative Human-Agent Interaction Generation using Hierarchical Latent Diffusion and Language Models [14.130327598928778]
Large language models (LLMs) and hierarchical motion-specific vector-quantized variational autoencoders (VQ-VAEs) are proposed.
Our framework generates realistic and diverse collaborative human-object-human interactions, outperforming state-of-the-art methods.
Our work opens up new possibilities for modeling complex interactions in various domains, such as robotics, graphics and computer vision.
arXiv Detail & Related papers (2024-09-30T17:02:13Z) - HEMM: Holistic Evaluation of Multimodal Foundation Models [91.60364024897653]
Multimodal foundation models can holistically process text alongside images, video, audio, and other sensory modalities.
It is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains.
arXiv Detail & Related papers (2024-07-03T18:00:48Z) - From Efficient Multimodal Models to World Models: A Survey [28.780451336834876]
Multimodal Large Models (MLMs) are becoming a significant research focus combining powerful language models with multimodal learning.
This review explores the latest developments and challenges in large instructions, emphasizing their potential in achieving artificial general intelligence.
arXiv Detail & Related papers (2024-06-27T15:36:43Z) - Foundations of Multisensory Artificial Intelligence [32.56967614091527]
This thesis aims to advance the machine learning foundations of multisensory AI.
In the first part, we present a theoretical framework formalizing how modalities interact with each other to give rise to new information for a task.
In the second part, we study the design of practical multimodal foundation models that generalize over many modalities and tasks.
arXiv Detail & Related papers (2024-04-29T14:45:28Z) - LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models.
Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer.
We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z) - MONAL: Model Autophagy Analysis for Modeling Human-AI Interactions [11.972017738888825]
We propose Model Autophagy Analysis (MONAL) for large models' self-consumption explanation.
MONAL employs two distinct autophagous loops to elucidate the suppression of human-generated information in the exchange between human and AI systems.
We evaluate the capacities of generated models as both creators and disseminators of information.
arXiv Detail & Related papers (2024-02-17T13:02:54Z) - An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents.
Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction.
We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z) - Foundation Models for Decision Making: Problems, Methods, and
Opportunities [124.79381732197649]
Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks.
New paradigms are emerging for training foundation models to interact with other agents and perform long-term reasoning.
Research at the intersection of foundation models and decision making holds tremendous promise for creating powerful new systems.
arXiv Detail & Related papers (2023-03-07T18:44:07Z) - MultiViz: An Analysis Benchmark for Visualizing and Understanding
Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages.
We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.