Related papers: Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries

Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries

URL: http://arxiv.org/abs/2601.17535v2
Date: Tue, 27 Jan 2026 18:04:35 GMT
Title: Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries
Authors: Kevin Robbins, Xiaotong Liu, Yu Wu, Le Sun, Grady McPeak, Abby Stylianou, Robert Pless,
Abstract summary: We build on prior work using text-only comparisons to evaluate how well a model works for a given natural language task.<n>We explore approaches that also generate synthetic images relevant to that task to evaluate and refine the prediction of zero-shot accuracy.
Score: 19.511404894563455
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Vision-Language Models like CLIP create aligned embedding spaces for text and images, making it possible for anyone to build a visual classifier by simply naming the classes they want to distinguish. However, a model that works well in one domain may fail in another, and non-expert users have no straightforward way to assess whether their chosen VLM will work on their problem. We build on prior work using text-only comparisons to evaluate how well a model works for a given natural language task, and explore approaches that also generate synthetic images relevant to that task to evaluate and refine the prediction of zero-shot accuracy. We show that generated imagery to the baseline text-only scores substantially improves the quality of these predictions. Additionally, it gives a user feedback on the kinds of images that were used to make the assessment. Experiments on standard CLIP benchmark datasets demonstrate that the image-based approach helps users predict, without any labeled examples, whether a VLM will be effective for their application.

Related papers

AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs [27.133240420463807]
AlignBench is a benchmark that provides a new indicator of image-text alignment.<n>It evaluates detailed image-caption pairs generated by diverse image-to-text and text-to-image models.<n>Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators.
arXiv Detail & Related papers (2025-11-25T17:19:47Z)
SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining. SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z)
Revisiting the Role of Language Priors in Vision-Language Models [90.0317841097143]
Vision-language models (VLMs) are applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $textitgenerative VLMs$ that are trained for next-word generation given an image. We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks.
arXiv Detail & Related papers (2023-06-02T19:19:43Z)
Text-to-Image Diffusion Models are Zero-Shot Classifiers [8.26990105697146]
We investigate text-to-image diffusion models by proposing a method for evaluating them as zero-shot classifiers. We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge. They perform competitively with CLIP on a wide range of zero-shot image classification datasets.
arXiv Detail & Related papers (2023-03-27T14:15:17Z)
Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner. Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z)
Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods. We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data. Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z)
Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs. Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning. We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
No Token Left Behind: Explainability-Aided Image Classification and Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input. Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z)
Detection and Captioning with Unseen Object Classes [12.894104422808242]
Test images may contain visual objects with no corresponding visual or textual training examples. We propose a detection-driven approach based on a generalized zero-shot detection model and a template-based sentence generation model. Our experiments show that the proposed zero-shot detection model obtains state-of-the-art performance on the MS-COCO dataset.
arXiv Detail & Related papers (2021-08-13T10:43:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.