Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
- URL: http://arxiv.org/abs/2308.11186v1
- Date: Tue, 22 Aug 2023 04:24:45 GMT
- Title: Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
- Authors: Baoshuo Kan, Teng Wang, Wenpeng Lu, Xiantong Zhen, Weili Guan, Feng
Zheng
- Abstract summary: We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models.
Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
- Score: 64.24227572048075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained vision-language models, e.g., CLIP, working with manually
designed prompts have demonstrated great capacity of transfer learning.
Recently, learnable prompts achieve state-of-the-art performance, which however
are prone to overfit to seen classes, failing to generalize to unseen classes.
In this paper, we propose a Knowledge-Aware Prompt Tuning (KAPT) framework for
vision-language models. Our approach takes inspiration from human intelligence
in which external knowledge is usually incorporated into recognizing novel
categories of objects. Specifically, we design two complementary types of
knowledge-aware prompts for the text encoder to leverage the distinctive
characteristics of category-related external knowledge. The discrete prompt
extracts the key information from descriptions of an object category, and the
learned continuous prompt captures overall contexts. We further design an
adaptation head for the visual encoder to aggregate salient attentive visual
cues, which establishes discriminative and task-aware visual representations.
We conduct extensive experiments on 11 widely-used benchmark datasets and the
results verify the effectiveness in few-shot image classification, especially
in generalizing to unseen categories. Compared with the state-of-the-art CoCoOp
method, KAPT exhibits favorable performance and achieves an absolute gain of
3.22% on new classes and 2.57% in terms of harmonic mean.
Related papers
- IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning [94.52149969720712]
IntCoOp learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning.
IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
arXiv Detail & Related papers (2024-06-19T16:37:31Z) - AAPL: Adding Attributes to Prompt Learning for Vision-Language Models [6.32186874112557]
We propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts.
We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.
arXiv Detail & Related papers (2024-04-25T17:51:10Z) - Boosting Audio-visual Zero-shot Learning with Large Language Models [32.533844163120875]
We introduce a framework called KnowleDge-Augmented audio-visual learning (KDA)
Our proposed KDA can outperform state-of-the-art methods on three popular audio-visual zero-shot learning datasets.
arXiv Detail & Related papers (2023-11-21T01:18:23Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z) - SEGA: Semantic Guided Attention on Visual Prototype for Few-Shot
Learning [85.2093650907943]
We propose SEmantic Guided Attention (SEGA) to teach machines to recognize a new category.
SEGA uses semantic knowledge to guide the visual perception in a top-down manner about what visual features should be paid attention to.
We show that our semantic guided attention realizes anticipated function and outperforms state-of-the-art results.
arXiv Detail & Related papers (2021-11-08T08:03:44Z) - Class Knowledge Overlay to Visual Feature Learning for Zero-Shot Image
Classification [18.299463254965264]
We propose a novel zero-shot learning approach, GAN-CST, based on class knowledge to visual feature learning.
The proposed model delivers superior performance over state-of-the-art approaches.
arXiv Detail & Related papers (2021-02-26T06:34:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.