Related papers: Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe

Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe

URL: http://arxiv.org/abs/2603.04346v1
Date: Wed, 04 Mar 2026 18:07:23 GMT
Title: Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe
Authors: Chris Vorster, Mayug Maniparambil, Noel E. O'Connor, Noel Murphy, Derek Molloy,
Abstract summary: We propose a method to predict a Vision-Language Foundation Model's zero-shot accuracy on a target domain using only a single labelled image per class.<n>We demonstrate our method's performance across five diverse datasets, including standard benchmark datasets and underrepresented datasets from Africa.
Score: 8.707753549613766
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large-scale Vision-Language Foundation Models (VLFMs), such as CLIP, now underpin a wide range of computer vision research and applications. VLFMs are often adapted to various domain-specific tasks. However, VLFM performance on novel, specialised, or underrepresented domains remains inconsistent. Evaluating VLFMs typically requires labelled test sets, which are often unavailable for niche domains of interest, particularly those from the Global South. We address this gap by proposing a highly data-efficient method to predict a VLFM's zero-shot accuracy on a target domain using only a single labelled image per class. Our approach uses a Large Language Model to generate plausible counterfactual descriptions of a given image. By measuring the VLFM's ability to distinguish the correct description from these hard negatives, we engineer features that capture the VLFM's discriminative power in its shared embedding space. A linear regressor trained on these similarity scores estimates the VLFM's zero-shot test accuracy across various visual domains with a Pearson-r correlation of 0.96. We demonstrate our method's performance across five diverse datasets, including standard benchmark datasets and underrepresented datasets from Africa. Our work provides a low-cost, reliable tool for probing VLFMs, enabling researchers and practitioners to make informed decisions about data annotation efforts before committing significant resources. The model training code, generated captions and counterfactuals are released here: https://github.com/chris-vorster/PreLabellingProbe.

Related papers

LandSegmenter: Towards a Flexible Foundation Model for Land Use and Land Cover Mapping [13.59442852640533]
Land Use and Land Cover (LULC) mapping is a fundamental task in Earth Observation.<n>Recent advances in foundation models (FMs) offer promising opportunities for building universal models.<n>We propose LandSegmenter, an LULC FM framework that resolves three-stage challenges at the input, model, and output levels.
arXiv Detail & Related papers (2025-11-11T12:08:26Z)
Adapting Vision-Language Models Without Labels: A Comprehensive Survey [74.17944178027015]
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks.<n>Recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data.<n>We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms.
arXiv Detail & Related papers (2025-08-07T16:27:37Z)
Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts [13.21626568246313]
We analyze whether vision-language foundation models can be adapted to target datasets with very different distributions and classes.<n>We propose a novel prompt-tuning method, PromptMargin, for adapting such large-scale VLMs directly on the few target samples.<n>PromptMargin effectively tunes the text as well as visual prompts for this task, and has two main modules.
arXiv Detail & Related papers (2025-05-21T13:26:56Z)
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation [24.531539125814877]
Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks.<n>One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution.<n>Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality.
arXiv Detail & Related papers (2025-05-04T11:59:26Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction [80.67150791183126]
Pre-trained vision-language models (VLMs) have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks.<n>We propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations.<n>We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods.
arXiv Detail & Related papers (2024-12-09T06:34:23Z)
Membership Inference Attacks against Large Vision-Language Models [40.996912464828696]
Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios. Their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records. Detecting inappropriately used data in VLLMs remains a critical and unresolved issue.
arXiv Detail & Related papers (2024-11-05T08:35:08Z)
VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection [5.66050466694651]
We propose Vision-Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness. We also propose a new scoring function that enables data- and training-free outlier supervision via textual prompts. The resulting VL4AD model achieves competitive performance on widely used benchmark datasets.
arXiv Detail & Related papers (2024-09-25T20:12:10Z)
Revisiting Few-Shot Object Detection with Vision-Language Models [49.79495118650838]
We revisit the task of few-shot object detection (FSOD) in the context of recent foundational vision-language models (VLMs) We propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data. We discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community.
arXiv Detail & Related papers (2023-12-22T07:42:00Z)
Localized Symbolic Knowledge Distillation for Visual Commonsense Models [150.18129140140238]
We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model. We find that training on the localized commonsense corpus can successfully distill existing vision-language models to support a reference-as-input interface.
arXiv Detail & Related papers (2023-12-08T05:23:50Z)
ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models. We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.