Related papers: Black Box Few-Shot Adaptation for Vision-Language models

Black Box Few-Shot Adaptation for Vision-Language models

URL: http://arxiv.org/abs/2304.01752v3
Date: Thu, 17 Aug 2023 17:22:41 GMT
Title: Black Box Few-Shot Adaptation for Vision-Language models
Authors: Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos
Abstract summary: Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. We describe a black-box method for V-L few-shot adaptation that operates on pre-computed image and text features. We propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain.
Score: 41.49584259596654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaptation aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the model weights and can be computationally infeasible for large models with billions of parameters. To address these shortcomings, in this work, we describe a black-box method for V-L few-shot adaptation that (a) operates on pre-computed image and text features and hence works without access to the model's weights, (b) it is orders of magnitude faster at training time, (c) it is amenable to both supervised and unsupervised training, and (d) it can be even used to align image and text features computed from uni-modal models. To achieve this, we propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain. LFA is initialized from a closed-form solution to a least-squares problem and then it is iteratively updated by minimizing a re-ranking loss. Despite its simplicity, our approach can even surpass soft-prompt learning methods as shown by extensive experiments on 11 image and 2 video datasets.

Related papers

Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion [95.91087143020644]
We present LADiBI, a training-free framework that uses large-scale text-to-image diffusion models to solve blind inverse problems. By leveraging natural language prompts, LADiBI jointly models priors for both the target image and operator, allowing for flexible adaptation across a variety of tasks. Our experiments show that LADiBI is capable of solving a broad range of image restoration tasks, including both linear and nonlinear problems.
arXiv Detail & Related papers (2024-11-30T18:55:01Z)
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z)
Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model [43.738677778740325]
We propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle. Candle achieves state-of-the-art performance over extensive experiments on 11 diverse datasets.
arXiv Detail & Related papers (2024-06-18T14:07:13Z)
Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks. MLE training can lead the context vector to over-fit dominant image features in the training data. This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z)
VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons. We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes. We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z)
SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models [9.017387427570538]
Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs. Due to their size, fine-tuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required. We present a new approach called SVL-Adapter that combines the complementary strengths of both vision-language pretraining and self-supervised representation learning.
arXiv Detail & Related papers (2022-10-07T19:35:08Z)
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.