Bridge the Modality and Capacity Gaps in Vision-Language Model Selection
- URL: http://arxiv.org/abs/2403.13797v1
- Date: Wed, 20 Mar 2024 17:54:58 GMT
- Title: Bridge the Modality and Capacity Gaps in Vision-Language Model Selection
- Authors: Chao Yi, De-Chuan Zhan, Han-Jia Ye,
- Abstract summary: Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names.
A promising zero-shot image classification strategy is selecting the most appropriate Pre-Trained VLM from the VLM Zoo.
We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection.
We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of these two gaps.
- Score: 60.049430086731846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. Thus, a promising zero-shot image classification strategy is selecting the most appropriate Pre-Trained VLM from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset's images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the "Modality Gap" -- the disparity in VLM's embeddings across two different modalities, making text a less reliable substitute for images; and the "Capability Gap" -- the discrepancy between the VLM's overall ranking and its ranking for target dataset, hindering direct prediction of a model's dataset-specific performance from its general performance. We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of these two gaps. SWAB first adopts optimal transport to capture the relevance between open-source datasets and target dataset with a transportation matrix. It then uses this matrix to transfer useful statistics of VLMs from open-source datasets to the target dataset for bridging those two gaps and enhancing the VLM's capacity estimation for VLM selection. Experiments across various VLMs and image classification datasets validate SWAB's effectiveness.
Related papers
- The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge [14.330962576584446]
This report introduces an enhanced method for the Foundational Few-Shot Object Detection (FSOD) task, leveraging the vision-language model (VLM) for object detection.
We propose the VLM+ framework, which integrates the multimodal large language model (MM-LLM)
We use these referential expressions to generate pseudo-labels for all images in the training set and then combine them with the original labeled data to fine-tune the VLM.
arXiv Detail & Related papers (2024-06-18T03:03:02Z) - Concept-skill Transferability-based Data Selection for Large Vision-Language Models [56.0725292404808]
We introduce COINCIDE, an effective and scalable data selection technique for training vision-language models.
We cluster the training data using internal activations from a small model, which identifies concept-skill compositions needed by a target LVLM.
Experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines.
arXiv Detail & Related papers (2024-06-16T16:15:20Z) - OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation [54.98688607911399]
We introduce a new task domain adaptation for open-vocabulary segmentation.
We propose an approach that combines parameter-efficient prompt tuning with a triplet-loss-based training strategy.
Our results outperform other parameter-efficient adaptation strategies in open-vocabulary segment classification tasks.
arXiv Detail & Related papers (2024-05-30T15:16:06Z) - Why are Visually-Grounded Language Models Bad at Image Classification? [39.76294811955341]
We revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA.
We find that existing proprietary and public VLMs significantly underperform CLIP on standard image classification benchmarks like ImageNet.
Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data.
arXiv Detail & Related papers (2024-05-28T17:57:06Z) - Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs)
In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM.
In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z) - Leveraging Vision-Language Models for Improving Domain Generalization in
Image Classification [35.277880733198586]
Vision-Language Models (VLMs) are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions.
We propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model.
This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings.
arXiv Detail & Related papers (2023-10-12T11:59:54Z) - TAP: Targeted Prompting for Task Adaptive Generation of Textual Training
Instances for Visual Classification [28.72126911321771]
Vision and Language Models (VLMs) have enabled visual recognition of a potentially unlimited set of categories described by text prompts.
For the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks.
arXiv Detail & Related papers (2023-09-13T08:59:54Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.