Bridge the Modality and Capability Gaps in Vision-Language Model Selection
- URL: http://arxiv.org/abs/2403.13797v2
- Date: Sat, 02 Nov 2024 03:14:39 GMT
- Title: Bridge the Modality and Capability Gaps in Vision-Language Model Selection
- Authors: Chao Yi, Yu-Hang He, De-Chuan Zhan, Han-Jia Ye,
- Abstract summary: Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names.
To better reuse the VLM resource, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo.
We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection.
We propose VLM Selection With gAp Bridging to mitigate the negative impact of two gaps.
- Score: 62.26769826687365
- License:
- Abstract: Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. To better reuse the VLM resource and fully leverage its potential on different zero-shot image classification tasks, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset's images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the "Modality Gap" - the disparity in VLM's embeddings across two different modalities, making text a less reliable substitute for images; and the "Capability Gap" - the discrepancy between the VLM's overall ranking and its ranking for target dataset, hindering direct prediction of a model's dataset-specific performance from its general performance. We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of two gaps. SWAB first adopts optimal transport to capture the relevance between open-source and target datasets with a transportation matrix. It then uses this matrix to transfer useful statistics of VLMs from open-source datasets to the target dataset for bridging two gaps. By bridging two gaps to obtain better substitutes for test images, SWAB can accurately predict the performance ranking of different VLMs on the target task without the need for the dataset's images. Experiments across various VLMs and image classification datasets validate SWAB's effectiveness.
Related papers
- Pre-Trained Vision-Language Model Selection and Reuse for Downstream Tasks [48.67303250592189]
This paper proposes a novel paradigm to select and reuse VLM for downstream tasks, called Model Label Learning (MLL)
The proposal is highly computationally efficient and growable since the model labeling process is completed target task independent.
arXiv Detail & Related papers (2025-01-30T11:10:46Z) - Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks [41.488394198111976]
Vision language models (VLMs) like CLIP show stellar zero-shot capability on classification benchmarks.
selecting the VLM with the highest performance on the unlabeled downstream task is non-trivial.
This paper introduces the problem of textbfunsupervised vision-language model selection, where only unsupervised downstream datasets are available.
arXiv Detail & Related papers (2024-12-30T03:26:53Z) - Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies [0.9217021281095907]
This study evaluates the effectiveness of Vision Language Models (VLMs) in representing and utilizing multimodal content for fact-checking.
We show that while multimodality can enhance performance, fusing separate embeddings from text and image encoders yielded superior results compared to using VLM embeddings.
arXiv Detail & Related papers (2024-12-06T16:13:19Z) - Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers [79.45405711339322]
Generative Large Multimodal Models (LMMs) excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering.
We propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks.
arXiv Detail & Related papers (2024-11-28T18:55:41Z) - The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge [14.330962576584446]
This report introduces an enhanced method for the Foundational Few-Shot Object Detection (FSOD) task, leveraging the vision-language model (VLM) for object detection.
We propose the VLM+ framework, which integrates the multimodal large language model (MM-LLM)
We use these referential expressions to generate pseudo-labels for all images in the training set and then combine them with the original labeled data to fine-tune the VLM.
arXiv Detail & Related papers (2024-06-18T03:03:02Z) - Concept-skill Transferability-based Data Selection for Large Vision-Language Models [56.0725292404808]
We introduce COINCIDE, an effective and scalable data selection technique for training vision-language models.
We cluster the training data using internal activations from a small model, which identifies concept-skill compositions needed by a target LVLM.
Experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines.
arXiv Detail & Related papers (2024-06-16T16:15:20Z) - Why are Visually-Grounded Language Models Bad at Image Classification? [39.76294811955341]
We revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA.
We find that existing proprietary and public VLMs significantly underperform CLIP on standard image classification benchmarks like ImageNet.
Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data.
arXiv Detail & Related papers (2024-05-28T17:57:06Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.