Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
- URL: http://arxiv.org/abs/2503.11609v1
- Date: Fri, 14 Mar 2025 17:24:01 GMT
- Title: Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
- Authors: Matteo Farina, Massimiliano Mancini, Giovanni Iacca, Elisa Ricci,
- Abstract summary: In Few-Shot Adaptation (FSA), data are insufficient to fit a large number of parameters, rendering the above impractical.<n>We call this scheme Two-Stage Few-Shot Adaptation (2SFS)<n>We show that 2SFS matches or surpasses the state-of-the-art, while established methods degrade significantly across settings.
- Score: 28.834044800595716
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: An old-school recipe for training a classifier is to (i) learn a good feature extractor and (ii) optimize a linear layer atop. When only a handful of samples are available per category, as in Few-Shot Adaptation (FSA), data are insufficient to fit a large number of parameters, rendering the above impractical. This is especially true with large pre-trained Vision-Language Models (VLMs), which motivated successful research at the intersection of Parameter-Efficient Fine-tuning (PEFT) and FSA. In this work, we start by analyzing the learning dynamics of PEFT techniques when trained on few-shot data from only a subset of categories, referred to as the ``base'' classes. We show that such dynamics naturally splits into two distinct phases: (i) task-level feature extraction and (ii) specialization to the available concepts. To accommodate this dynamic, we then depart from prompt- or adapter-based methods and tackle FSA differently. Specifically, given a fixed computational budget, we split it to (i) learn a task-specific feature extractor via PEFT and (ii) train a linear classifier on top. We call this scheme Two-Stage Few-Shot Adaptation (2SFS). Differently from established methods, our scheme enables a novel form of selective inference at a category level, i.e., at test time, only novel categories are embedded by the adapted text encoder, while embeddings of base categories are available within the classifier. Results with fixed hyperparameters across two settings, three backbones, and eleven datasets, show that 2SFS matches or surpasses the state-of-the-art, while established methods degrade significantly across settings.
Related papers
- Label Anything: An Interpretable, High-Fidelity and Prompt-Free Annotator [29.2532061585323]
Traditional manual labeling involves high cost to annotate vast amount of required data for training robust model.<n>We propose a Label Anything Model (LAM) serving as an interpretable, high-fidelity, and prompt-free data annotator.<n>LAM can generate high-fidelity annotations (almost 100% in mIoU) for multiple real-world datasets.
arXiv Detail & Related papers (2025-02-05T08:14:52Z) - UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation [38.331860053615955]
This paper introduces a novel framework for unified incremental few-shot object detection (iFSOD) and instance segmentation (iFSIS) using the Transformer architecture.
Our goal is to create an optimal solution for situations where only a few examples of novel object classes are available.
arXiv Detail & Related papers (2024-11-13T12:29:44Z) - Language-aware Multiple Datasets Detection Pretraining for DETRs [4.939595148195813]
We propose a framework for utilizing Multiple datasets to pretrain DETR-like detectors, termed METR.
It converts the typical multi-classification in object detection into binary classification by introducing a pre-trained language model.
We show METR achieves extraordinary results on either multi-task joint training or the pretrain & finetune paradigm.
arXiv Detail & Related papers (2023-04-07T10:34:04Z) - Strong Baselines for Parameter Efficient Few-Shot Fine-tuning [50.83426196335385]
Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase.
Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (ViT) on new test classes is a strong approach for FSC.
Fine-tuning ViTs, however, is expensive in time, compute and storage.
This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters.
arXiv Detail & Related papers (2023-04-04T16:14:39Z) - First Session Adaptation: A Strong Replay-Free Baseline for
Class-Incremental Learning [26.88977803220915]
First Session Adaptation (FSA) adapts a pre-trained neural network body only on the first learning session and fixes it thereafter.
FSA significantly improves over the state-of-the-art in 15 of the 16 settings considered.
We propose a measure that can be applied to a set of unlabelled inputs which is predictive of the benefits of body adaptation.
arXiv Detail & Related papers (2023-03-23T11:54:41Z) - CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time.
We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z) - Sylph: A Hypernetwork Framework for Incremental Few-shot Object
Detection [8.492340530784697]
We show that finetune-free iFSD can be highly effective when a large number of base categories with abundant data are available for meta-training.
We benchmark our model on both COCO and LVIS, reporting as high as $17%$ AP on the long-tail rare classes on LVIS.
arXiv Detail & Related papers (2022-03-25T20:39:00Z) - CIM: Class-Irrelevant Mapping for Few-Shot Classification [58.02773394658623]
Few-shot classification (FSC) is one of the most concerned hot issues in recent years.
How to appraise the pre-trained FEM is the most crucial focus in the FSC community.
We propose a simple, flexible method, dubbed as Class-Irrelevant Mapping (CIM)
arXiv Detail & Related papers (2021-09-07T03:26:24Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z) - Improving Calibration for Long-Tailed Recognition [68.32848696795519]
We propose two methods to improve calibration and performance in such scenarios.
For dataset bias due to different samplers, we propose shifted batch normalization.
Our proposed methods set new records on multiple popular long-tailed recognition benchmark datasets.
arXiv Detail & Related papers (2021-04-01T13:55:21Z) - Contrastive Prototype Learning with Augmented Embeddings for Few-Shot
Learning [58.2091760793799]
We propose a novel contrastive prototype learning with augmented embeddings (CPLAE) model.
With a class prototype as an anchor, CPL aims to pull the query samples of the same class closer and those of different classes further away.
Extensive experiments on several benchmarks demonstrate that our proposed CPLAE achieves new state-of-the-art.
arXiv Detail & Related papers (2021-01-23T13:22:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.