Variational Information Pursuit with Large Language and Multimodal
Models for Interpretable Predictions
- URL: http://arxiv.org/abs/2308.12562v1
- Date: Thu, 24 Aug 2023 05:04:10 GMT
- Title: Variational Information Pursuit with Large Language and Multimodal
Models for Interpretable Predictions
- Authors: Kwan Ho Ryan Chan, Aditya Chattopadhyay, Benjamin David Haeffele, Rene
Vidal
- Abstract summary: Variational Information Pursuit (V-IP) is a framework for making interpretable predictions by design.
Applying V-IP to any task requires data samples with dense concept-labeling by domain experts.
We extend the V-IP framework with Foundational Models (FMs) to address this limitation.
- Score: 9.07837207208113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Variational Information Pursuit (V-IP) is a framework for making
interpretable predictions by design by sequentially selecting a short chain of
task-relevant, user-defined and interpretable queries about the data that are
most informative for the task. While this allows for built-in interpretability
in predictive models, applying V-IP to any task requires data samples with
dense concept-labeling by domain experts, limiting the application of V-IP to
small-scale tasks where manual data annotation is feasible. In this work, we
extend the V-IP framework with Foundational Models (FMs) to address this
limitation. More specifically, we use a two-step process, by first leveraging
Large Language Models (LLMs) to generate a sufficiently large candidate set of
task-relevant interpretable concepts, then using Large Multimodal Models to
annotate each data sample by semantic similarity with each concept in the
generated concept set. While other interpretable-by-design frameworks such as
Concept Bottleneck Models (CBMs) require an additional step of removing
repetitive and non-discriminative concepts to have good interpretability and
test performance, we mathematically and empirically justify that, with a
sufficiently informative and task-relevant query (concept) set, the proposed
FM+V-IP method does not require any type of concept filtering. In addition, we
show that FM+V-IP with LLM generated concepts can achieve better test
performance than V-IP with human annotated concepts, demonstrating the
effectiveness of LLMs at generating efficient query sets. Finally, when
compared to other interpretable-by-design frameworks such as CBMs, FM+V-IP can
achieve competitive test performance using fewer number of concepts/queries in
both cases with filtered or unfiltered concept sets.
Related papers
- Advancing Visual Large Language Model for Multi-granular Versatile Perception [31.78788398688894]
We present MVP-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model.<n>Our framework is designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions.<n>MVP-LM features an innovative multi-granularity decoder in conjunction with a CoT-inspired dataset unification strategy.
arXiv Detail & Related papers (2025-07-22T04:09:14Z) - Reinforcing Multimodal Understanding and Generation with Dual Self-rewards [56.08202047680044]
Large language models (LLMs) unify cross-model understanding and generation into a single framework.<n>Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks.<n>We introduce a self-supervised dual reward mechanism to reinforce the understanding and generation capabilities of LMMs.
arXiv Detail & Related papers (2025-06-09T17:38:45Z) - Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective [42.832839189236694]
We propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images.<n>Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent.<n> Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
arXiv Detail & Related papers (2025-05-27T07:23:38Z) - Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing [43.75154489681047]
We propose a novel framework leveraging test-time scaling for Multi-Document Summarization (MDS)<n>Our approach employs prompt ensemble techniques to generate multiple candidate summaries using various prompts, then combines them with an aggregator to produce a refined summary.<n>To evaluate our method effectively, we also introduce two new LLM-based metrics: the Consistency-Aware Preference (CAP) score and LLM Atom-Content-Unit (LLM-ACU) score.
arXiv Detail & Related papers (2025-02-27T23:34:47Z) - Multi-Faceted Multimodal Monosemanticity [42.64636740703632]
We take a data-driven approach to analyze interpretable, monosemantic features extracted from deep multimodal models.<n>Specifically, we investigate CLIP, a prominent visual-language representation model trained on massive image-text pairs.<n>We develop a set of multi-modal interpretability tools and measures designed to disentangle and analyze features learned from CLIP.
arXiv Detail & Related papers (2025-02-16T14:51:07Z) - Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models [11.545127156146368]
We introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for pre-trained vision-language models (VLMs)
We create and evolve two sets of prototypes--textual and visual--to progressively capture more accurate multi-modal representations for target classes during test time.
Our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency.
arXiv Detail & Related papers (2024-10-16T17:59:49Z) - VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We build universal embedding models capable of handling a wide range of downstream tasks.
Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model.
We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Variational Information Pursuit for Interpretable Predictions [8.894670614193677]
Variational Information Pursuit (V-IP) is a variational characterization of IP which bypasses the need for learning generative models.
V-IP finds much shorter query chains when compared to reinforcement learning which is typically used in sequential-decision-making problems.
We demonstrate the utility of V-IP on challenging tasks like medical diagnosis where the performance is far superior to the generative modelling approach.
arXiv Detail & Related papers (2023-02-06T15:43:48Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z) - Multi-Task Learning for Dense Prediction Tasks: A Survey [87.66280582034838]
Multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint.
We provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision.
arXiv Detail & Related papers (2020-04-28T09:15:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.