LOVM: Language-Only Vision Model Selection
- URL: http://arxiv.org/abs/2306.08893v1
- Date: Thu, 15 Jun 2023 06:53:05 GMT
- Title: LOVM: Language-Only Vision Model Selection
- Authors: Orr Zohar, Shih-Cheng Huang, Kuan-Chieh Wang, Serena Yeung
- Abstract summary: We introduce a new task LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction.
We then introduced an extensive LOVM benchmark consisting of ground-truth evaluations of 35 pre-trained VLMs and 23 datasets.
- Score: 13.857583570058392
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained multi-modal vision-language models (VLMs) are becoming
increasingly popular due to their exceptional performance on downstream vision
applications, particularly in the few- and zero-shot settings. However,
selecting the best-performing VLM for some downstream applications is
non-trivial, as it is dataset and task-dependent. Meanwhile, the exhaustive
evaluation of all available VLMs on a novel application is not only time and
computationally demanding but also necessitates the collection of a labeled
dataset for evaluation. As the number of open-source VLM variants increases,
there is a need for an efficient model selection strategy that does not require
access to a curated evaluation dataset. This paper proposes a novel task and
benchmark for efficiently evaluating VLMs' zero-shot performance on downstream
applications without access to the downstream task dataset. Specifically, we
introduce a new task LOVM: Language-Only Vision Model Selection, where methods
are expected to perform both model selection and performance prediction based
solely on a text description of the desired downstream application. We then
introduced an extensive LOVM benchmark consisting of ground-truth evaluations
of 35 pre-trained VLMs and 23 datasets, where methods are expected to rank the
pre-trained VLMs and predict their zero-shot performance.
Related papers
- Pre-Trained Vision-Language Model Selection and Reuse for Downstream Tasks [48.67303250592189]
This paper proposes a novel paradigm to select and reuse VLM for downstream tasks, called Model Label Learning (MLL)
The proposal is highly computationally efficient and growable since the model labeling process is completed target task independent.
arXiv Detail & Related papers (2025-01-30T11:10:46Z) - Active Prompt Learning with Vision-Language Model Priors [9.173468790066956]
We introduce a class-guided clustering that leverages the pre-trained image and text encoders of vision-language models.
We propose a budget-saving selective querying based on adaptive class-wise thresholds.
arXiv Detail & Related papers (2024-11-23T02:34:33Z) - Active Learning for Vision-Language Models [29.309503214127016]
We propose a novel active learning (AL) framework that enhances the zero-shot classification performance of vision-language models (VLMs)
Our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection.
Our experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets.
arXiv Detail & Related papers (2024-10-29T16:25:50Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - RAVEN: Multitask Retrieval Augmented Vision-Language Learning [5.1583788731239455]
The scaling of large language models to encode all the world's knowledge is unsustainable and has exacerbated resource barriers.
Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored.
This paper introduces RAVEN, a retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning.
arXiv Detail & Related papers (2024-06-27T13:08:35Z) - Bridge the Modality and Capability Gaps in Vision-Language Model Selection [62.26769826687365]
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names.
To better reuse the VLM resource, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo.
We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection.
We propose VLM Selection With gAp Bridging to mitigate the negative impact of two gaps.
arXiv Detail & Related papers (2024-03-20T17:54:58Z) - Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs)
In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM.
In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.