Related papers: Bridge the Modality and Capability Gaps in Vision-Language Model Selection

Bridge the Modality and Capability Gaps in Vision-Language Model Selection

URL: http://arxiv.org/abs/2403.13797v2
Date: Sat, 02 Nov 2024 03:14:39 GMT
Title: Bridge the Modality and Capability Gaps in Vision-Language Model Selection
Authors: Chao Yi, Yu-Hang He, De-Chuan Zhan, Han-Jia Ye,
Abstract summary: Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. To better reuse the VLM resource, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo. We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection. We propose VLM Selection With gAp Bridging to mitigate the negative impact of two gaps.
Score: 62.26769826687365
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. To better reuse the VLM resource and fully leverage its potential on different zero-shot image classification tasks, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset's images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the "Modality Gap" - the disparity in VLM's embeddings across two different modalities, making text a less reliable substitute for images; and the "Capability Gap" - the discrepancy between the VLM's overall ranking and its ranking for target dataset, hindering direct prediction of a model's dataset-specific performance from its general performance. We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of two gaps. SWAB first adopts optimal transport to capture the relevance between open-source and target datasets with a transportation matrix. It then uses this matrix to transfer useful statistics of VLMs from open-source datasets to the target dataset for bridging two gaps. By bridging two gaps to obtain better substitutes for test images, SWAB can accurately predict the performance ranking of different VLMs on the target task without the need for the dataset's images. Experiments across various VLMs and image classification datasets validate SWAB's effectiveness.

Related papers

ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z)
Pre-Trained Vision-Language Model Selection and Reuse for Downstream Tasks [48.67303250592189]
This paper proposes a novel paradigm to select and reuse VLM for downstream tasks, called Model Label Learning (MLL) The proposal is highly computationally efficient and growable since the model labeling process is completed target task independent.
arXiv Detail & Related papers (2025-01-30T11:10:46Z)
Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks [41.488394198111976]
Vision language models (VLMs) like CLIP show stellar zero-shot capability on classification benchmarks. selecting the VLM with the highest performance on the unlabeled downstream task is non-trivial. This paper introduces the problem of textbfunsupervised vision-language model selection, where only unsupervised downstream datasets are available.
arXiv Detail & Related papers (2024-12-30T03:26:53Z)
Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies [0.9217021281095907]
This study evaluates the effectiveness of Vision Language Models (VLMs) in representing and utilizing multimodal content for fact-checking. We show that while multimodality can enhance performance, fusing separate embeddings from text and image encoders yielded superior results compared to using VLM embeddings.
arXiv Detail & Related papers (2024-12-06T16:13:19Z)
Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers [79.45405711339322]
Generative Large Multimodal Models (LMMs) excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. We propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks.
arXiv Detail & Related papers (2024-11-28T18:55:41Z)
Membership Inference Attacks against Large Vision-Language Models [40.996912464828696]
Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios. Their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records. Detecting inappropriately used data in VLLMs remains a critical and unresolved issue.
arXiv Detail & Related papers (2024-11-05T08:35:08Z)
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models [85.30735602813093]
Multi-Image Augmented Direct Preference Optimization (MIA-DPO) is a visual preference alignment approach that effectively handles multi-image inputs. MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats.
arXiv Detail & Related papers (2024-10-23T07:56:48Z)
The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge [14.330962576584446]
This report introduces an enhanced method for the Foundational Few-Shot Object Detection (FSOD) task, leveraging the vision-language model (VLM) for object detection. We propose the VLM+ framework, which integrates the multimodal large language model (MM-LLM) We use these referential expressions to generate pseudo-labels for all images in the training set and then combine them with the original labeled data to fine-tune the VLM.
arXiv Detail & Related papers (2024-06-18T03:03:02Z)
Concept-skill Transferability-based Data Selection for Large Vision-Language Models [56.0725292404808]
We introduce COINCIDE, an effective and scalable data selection technique for training vision-language models. We cluster the training data using internal activations from a small model, which identifies concept-skill compositions needed by a target LVLM. Experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines.
arXiv Detail & Related papers (2024-06-16T16:15:20Z)
Why are Visually-Grounded Language Models Bad at Image Classification? [39.76294811955341]
We revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs significantly underperform CLIP on standard image classification benchmarks like ImageNet. Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data.
arXiv Detail & Related papers (2024-05-28T17:57:06Z)
CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way. Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z)
Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification [35.277880733198586]
Vision-Language Models (VLMs) are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions. We propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model. This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings.
arXiv Detail & Related papers (2023-10-12T11:59:54Z)
TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification [28.72126911321771]
Vision and Language Models (VLMs) have enabled visual recognition of a potentially unlimited set of categories described by text prompts. For the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks.
arXiv Detail & Related papers (2023-09-13T08:59:54Z)
ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models. We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z)
Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research. In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks. We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.