Variational Information Pursuit with Large Language and Multimodal
Models for Interpretable Predictions
- URL: http://arxiv.org/abs/2308.12562v1
- Date: Thu, 24 Aug 2023 05:04:10 GMT
- Title: Variational Information Pursuit with Large Language and Multimodal
Models for Interpretable Predictions
- Authors: Kwan Ho Ryan Chan, Aditya Chattopadhyay, Benjamin David Haeffele, Rene
Vidal
- Abstract summary: Variational Information Pursuit (V-IP) is a framework for making interpretable predictions by design.
Applying V-IP to any task requires data samples with dense concept-labeling by domain experts.
We extend the V-IP framework with Foundational Models (FMs) to address this limitation.
- Score: 9.07837207208113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Variational Information Pursuit (V-IP) is a framework for making
interpretable predictions by design by sequentially selecting a short chain of
task-relevant, user-defined and interpretable queries about the data that are
most informative for the task. While this allows for built-in interpretability
in predictive models, applying V-IP to any task requires data samples with
dense concept-labeling by domain experts, limiting the application of V-IP to
small-scale tasks where manual data annotation is feasible. In this work, we
extend the V-IP framework with Foundational Models (FMs) to address this
limitation. More specifically, we use a two-step process, by first leveraging
Large Language Models (LLMs) to generate a sufficiently large candidate set of
task-relevant interpretable concepts, then using Large Multimodal Models to
annotate each data sample by semantic similarity with each concept in the
generated concept set. While other interpretable-by-design frameworks such as
Concept Bottleneck Models (CBMs) require an additional step of removing
repetitive and non-discriminative concepts to have good interpretability and
test performance, we mathematically and empirically justify that, with a
sufficiently informative and task-relevant query (concept) set, the proposed
FM+V-IP method does not require any type of concept filtering. In addition, we
show that FM+V-IP with LLM generated concepts can achieve better test
performance than V-IP with human annotated concepts, demonstrating the
effectiveness of LLMs at generating efficient query sets. Finally, when
compared to other interpretable-by-design frameworks such as CBMs, FM+V-IP can
achieve competitive test performance using fewer number of concepts/queries in
both cases with filtered or unfiltered concept sets.
Related papers
- Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models [11.545127156146368]
We introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for pre-trained vision-language models (VLMs)
We create and evolve two sets of prototypes--textual and visual--to progressively capture more accurate multi-modal representations for target classes during test time.
Our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency.
arXiv Detail & Related papers (2024-10-16T17:59:49Z) - VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We build universal embedding models capable of handling a wide range of downstream tasks.
Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model.
We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Variational Information Pursuit for Interpretable Predictions [8.894670614193677]
Variational Information Pursuit (V-IP) is a variational characterization of IP which bypasses the need for learning generative models.
V-IP finds much shorter query chains when compared to reinforcement learning which is typically used in sequential-decision-making problems.
We demonstrate the utility of V-IP on challenging tasks like medical diagnosis where the performance is far superior to the generative modelling approach.
arXiv Detail & Related papers (2023-02-06T15:43:48Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z) - Multi-Task Learning for Dense Prediction Tasks: A Survey [87.66280582034838]
Multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint.
We provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision.
arXiv Detail & Related papers (2020-04-28T09:15:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.