Related papers: Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting

Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting

URL: http://arxiv.org/abs/2507.20834v1
Date: Mon, 28 Jul 2025 13:41:24 GMT
Title: Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting
Authors: Alexey Kravets, Da Chen, Vinay P. Namboodiri,
Abstract summary: Several methods have shown improved performance of CLIP using few-shot examples.<n>We argue that this mode of evaluation does not provide a true indication of the inductive generalization ability.<n>We propose a pipeline that uses an unlearning technique to obtain true inductive baselines.
Score: 26.843330914828503
License: http://creativecommons.org/licenses/by/4.0/
Abstract: CLIP is a foundational model with transferable classification performance in the few-shot setting. Several methods have shown improved performance of CLIP using few-shot examples. However, so far, all these techniques have been benchmarked using standard few-shot datasets. We argue that this mode of evaluation does not provide a true indication of the inductive generalization ability using few-shot examples. As most datasets have been seen by the CLIP model, the resultant setting can be termed as partially transductive. To solve this, we propose a pipeline that uses an unlearning technique to obtain true inductive baselines. In this new inductive setting, the methods show a significant drop in performance (-55% on average among 13 baselines with multiple datasets). We validate the unlearning technique using oracle baselines. An improved few-shot classification technique is proposed that consistently obtains state-of-the-art performance over 13 other recent baseline methods on a comprehensive analysis with 5880 experiments - varying the datasets, differing number of few-shot examples, unlearning setting, and with different seeds. Thus, we identify the issue with the evaluation of CLIP-based few-shot classification, provide a solution using unlearning, propose new benchmarks, and provide an improved method.

Related papers

Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation [6.4212082894269535]
We compare existing leakage detection techniques, namely permutation and n-gram-based methods.<n>Our analysis shows that the n-gram method consistently achieves the highest F1-score.<n>We create cleaned versions of MMLU and HellaSwag, and re-evaluate several LLMs.
arXiv Detail & Related papers (2025-05-30T06:37:39Z)
Benchmarking Counterfactual Interpretability in Deep Learning Models for Time Series Classification [6.683066713491661]
Counterfactual (CF) methods are used to identify minimal changes in instances to alter the model predictions. Despite extensive research, no existing work benchmarks CF methods in the time series domain. In this work, we redesign quantitative metrics to accurately capture desirable characteristics in CFs.
arXiv Detail & Related papers (2024-08-22T18:17:26Z)
DeCoOp: Robust Prompt Tuning with Out-of-Distribution Detection [52.100335904875614]
We present a novel prompt tuning approach, namely, Decomposed Context Optimization (DeCoOp), which introduces new-class detectors and sub-classifiers to further enhance the base-class and new-class discriminability. Experimental results on 11 benchmark datasets validate the effectiveness of DePT and demonstrate that DeCoOp outperforms current state-of-the-art methods, providing a significant 2% average accuracy improvement.
arXiv Detail & Related papers (2024-06-01T07:46:42Z)
AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning [50.78033979438031]
We first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias. Based on analysis of key components, this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification.
arXiv Detail & Related papers (2024-04-13T10:46:11Z)
Transductive Zero-Shot and Few-Shot CLIP [24.592841797020203]
This paper addresses the transductive zero-shot and few-shot CLIP classification challenge. Inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently. Our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance.
arXiv Detail & Related papers (2024-04-08T12:44:31Z)
Rethinking Few-shot 3D Point Cloud Semantic Segmentation [62.80639841429669]
This paper revisits few-shot 3D point cloud semantic segmentation (FS-PCS) We focus on two significant issues in the state-of-the-art: foreground leakage and sparse point distribution. To address these issues, we introduce a standardized FS-PCS setting, upon which a new benchmark is built.
arXiv Detail & Related papers (2024-03-01T15:14:47Z)
A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation [121.0693322732454]
Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods to enhance CLIP's performance in downstream tasks. We revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP.
arXiv Detail & Related papers (2024-02-06T15:45:27Z)
Accounting for multiplicity in machine learning benchmark performance [0.0]
State-of-the-art (SOTA) performance refers to the highest performance achieved by some model on a test sample.<n>We argue that SOTA should instead be estimated by the expected performance of the best classifier.
arXiv Detail & Related papers (2023-03-10T10:32:18Z)
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation [57.17709477668213]
We introduce a new class-conditioned sampling method: Bias Mimicking. Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks.
arXiv Detail & Related papers (2022-09-30T17:33:00Z)
Learning to Select Base Classes for Few-shot Classification [96.92372639495551]
We use the Similarity Ratio as an indicator for the generalization performance of a few-shot model. We then formulate the base class selection problem as a submodular optimization problem over Similarity Ratio.
arXiv Detail & Related papers (2020-04-01T09:55:18Z)
Frustratingly Simple Few-Shot Object Detection [98.42824677627581]
We find that fine-tuning only the last layer of existing detectors on rare classes is crucial to the few-shot object detection task. Such a simple approach outperforms the meta-learning methods by roughly 220 points on current benchmarks.
arXiv Detail & Related papers (2020-03-16T00:29:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.