Black Box Few-Shot Adaptation for Vision-Language models
- URL: http://arxiv.org/abs/2304.01752v3
- Date: Thu, 17 Aug 2023 17:22:41 GMT
- Title: Black Box Few-Shot Adaptation for Vision-Language models
- Authors: Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos
- Abstract summary: Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners.
We describe a black-box method for V-L few-shot adaptation that operates on pre-computed image and text features.
We propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain.
- Score: 41.49584259596654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language (V-L) models trained with contrastive learning to align the
visual and language modalities have been shown to be strong few-shot learners.
Soft prompt learning is the method of choice for few-shot downstream adaptation
aiming to bridge the modality gap caused by the distribution shift induced by
the new domain. While parameter-efficient, prompt learning still requires
access to the model weights and can be computationally infeasible for large
models with billions of parameters. To address these shortcomings, in this
work, we describe a black-box method for V-L few-shot adaptation that (a)
operates on pre-computed image and text features and hence works without access
to the model's weights, (b) it is orders of magnitude faster at training time,
(c) it is amenable to both supervised and unsupervised training, and (d) it can
be even used to align image and text features computed from uni-modal models.
To achieve this, we propose Linear Feature Alignment (LFA), a simple linear
approach for V-L re-alignment in the target domain. LFA is initialized from a
closed-form solution to a least-squares problem and then it is iteratively
updated by minimizing a re-ranking loss. Despite its simplicity, our approach
can even surpass soft-prompt learning methods as shown by extensive experiments
on 11 image and 2 video datasets.
Related papers
- When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective [57.05315507519704]
We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing.
Our measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%.
arXiv Detail & Related papers (2024-09-03T12:03:45Z) - Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model [43.738677778740325]
We propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle.
Candle achieves state-of-the-art performance over extensive experiments on 11 diverse datasets.
arXiv Detail & Related papers (2024-06-18T14:07:13Z) - Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt
Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks.
MLE training can lead the context vector to over-fit dominant image features in the training data.
This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained
Models [9.017387427570538]
Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs.
Due to their size, fine-tuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required.
We present a new approach called SVL-Adapter that combines the complementary strengths of both vision-language pretraining and self-supervised representation learning.
arXiv Detail & Related papers (2022-10-07T19:35:08Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.