Related papers: AAPL: Adding Attributes to Prompt Learning for Vision-Language Models

AAPL: Adding Attributes to Prompt Learning for Vision-Language Models

URL: http://arxiv.org/abs/2404.16804v1
Date: Thu, 25 Apr 2024 17:51:10 GMT
Title: AAPL: Adding Attributes to Prompt Learning for Vision-Language Models
Authors: Gahyeon Kim, Sohee Kim, Seokju Lee,
Abstract summary: We propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.
Score: 6.32186874112557
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.

Related papers

InPK: Infusing Prior Knowledge into Prompt for Vision-Language Models [24.170351966913557]
We propose the InPK model, which infuses class-specific prior knowledge into the learnable tokens. We also introduce a learnable text-to-vision projection layer to accommodate the text adjustments. In experiments, InPK significantly outperforms state-of-the-art methods in multiple zero/few-shot image classification tasks.
arXiv Detail & Related papers (2025-02-27T05:33:18Z)
Active Prompt Learning with Vision-Language Model Priors [9.173468790066956]
We introduce a class-guided clustering that leverages the pre-trained image and text encoders of vision-language models. We propose a budget-saving selective querying based on adaptive class-wise thresholds.
arXiv Detail & Related papers (2024-11-23T02:34:33Z)
Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation [14.225723195634941]
We propose a novel approach to prompt learning based on unsupervised knowledge distillation from more powerful models. Our approach, which we call Knowledge Distillation Prompt Learning (KDPL), can be integrated into existing prompt learning techniques.
arXiv Detail & Related papers (2024-07-03T12:24:40Z)
IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning [94.52149969720712]
IntCoOp learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning. IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
arXiv Detail & Related papers (2024-06-19T16:37:31Z)
Can Better Text Semantics in Prompt Tuning Improve VLM Generalization? [28.041879000565874]
We introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models. Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts. Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods.
arXiv Detail & Related papers (2024-05-13T16:52:17Z)
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [64.24227572048075]
We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models. Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
arXiv Detail & Related papers (2023-08-22T04:24:45Z)
DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem. Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z)
Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation [15.467510304266883]
We study the impact of prompt tuning on weakly supervised semantic segmentation. We introduce a novel approach based on a PrOmpt cLass lEarning (POLE) strategy. We demonstrate that our simple, yet efficient approach achieves SOTA performance in a well-known WSSS benchmark.
arXiv Detail & Related papers (2023-06-30T19:25:18Z)
Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats. We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes. We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z)
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models. SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z)
Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models [108.13378788663196]
We propose Subspace Prompt Tuning (SubPT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process. We equip CoOp with Novel Learner Feature (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set.
arXiv Detail & Related papers (2022-11-04T02:06:22Z)
Conditional Prompt Learning for Vision-Language Models [107.06776396086471]
A recently proposed method named Context Optimization (CoOp) turns context words in a prompt into a set of learnable vectors. CoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset. Our experiments show that CoCoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset.
arXiv Detail & Related papers (2022-03-10T18:59:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.