ECO: Ensembling Context Optimization for Vision-Language Models
- URL: http://arxiv.org/abs/2307.14063v1
- Date: Wed, 26 Jul 2023 09:31:06 GMT
- Title: ECO: Ensembling Context Optimization for Vision-Language Models
- Authors: Lorenzo Agnolucci, Alberto Baldrati, Francesco Todino, Federico
Becattini, Marco Bertini, Alberto Del Bimbo
- Abstract summary: We show that learning diverse and possibly shorter contexts improves considerably and consistently the results.
We report better few-shot capabilities with no additional cost at inference time.
- Score: 22.32996522125523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image recognition has recently witnessed a paradigm shift, where
vision-language models are now used to perform few-shot classification based on
textual prompts. Among these, the CLIP model has shown remarkable capabilities
for zero-shot transfer by matching an image and a custom textual prompt in its
latent space. This has paved the way for several works that focus on
engineering or learning textual contexts for maximizing CLIP's classification
capabilities. In this paper, we follow this trend by learning an ensemble of
prompts for image classification. We show that learning diverse and possibly
shorter contexts improves considerably and consistently the results rather than
relying on a single trainable prompt. In particular, we report better few-shot
capabilities with no additional cost at inference time. We demonstrate the
capabilities of our approach on 11 different benchmarks.
Related papers
- Ranking-aware adapter for text-driven image ordering with CLIP [76.80965830448781]
We propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task.
Our approach incorporates learnable prompts to adapt to new instructions for ranking purposes.
Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks.
arXiv Detail & Related papers (2024-12-09T18:51:05Z) - Grounding Descriptions in Images informs Zero-Shot Visual Recognition [47.66166611138081]
We propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously.
We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods.
arXiv Detail & Related papers (2024-12-05T18:52:00Z) - Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment [57.07360640784803]
We propose vision-language consistency guided multi-modal prompt learning for blind image quality assessment (AGIQA)
Specifically, we introduce learnable textual and visual prompts in language and vision branches of Contrastive Language-Image Pre-training (CLIP) models.
We design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts.
arXiv Detail & Related papers (2024-06-24T13:45:31Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - CoPL: Contextual Prompt Learning for Vision-Language Understanding [21.709017504227823]
We propose a Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to the localized features of the image.
Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand.
Our method produces substantially improved performance when compared to the current state of the art methods.
arXiv Detail & Related papers (2023-07-03T10:14:33Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - Zero-shot Image Captioning by Anchor-augmented Vision-Language Space
Alignment [23.072180427273544]
We discuss that directly employing CLIP for zero-shot image captioning relies more on the textual modality in context and largely ignores the visual information.
To address this, we propose Cross-modal Language Models (CLMs) to facilitate unsupervised cross-modal learning.
Experiments on MS COCO and Flickr 30K validate the promising performance of proposed approach in both captioning quality and computational efficiency.
arXiv Detail & Related papers (2022-11-14T11:12:19Z) - Distinctive Image Captioning via CLIP Guided Group Optimization [13.102953452346297]
In this paper, we focus on generating the distinctive captions that can distinguish the target image from other similar images.
We introduce a series of metrics that use large-scale vision-language pre-training model CLIP to quantify the distinctiveness.
We propose a simple and effective training strategy which trains the model by comparing target image with similar image group and optimizing the group embedding gap.
arXiv Detail & Related papers (2022-08-08T16:37:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.