Concept-Guided Prompt Learning for Generalization in Vision-Language
Models
- URL: http://arxiv.org/abs/2401.07457v1
- Date: Mon, 15 Jan 2024 04:04:47 GMT
- Title: Concept-Guided Prompt Learning for Generalization in Vision-Language
Models
- Authors: Yi Zhang, Ce Zhang, Ke Yu, Yushun Tang, Zhihai He
- Abstract summary: We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
- Score: 33.361744437967126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pretraining (CLIP) model has exhibited remarkable
efficacy in establishing cross-modal connections between texts and images,
yielding impressive performance across a broad spectrum of downstream
applications through fine-tuning. However, for generalization tasks, the
current fine-tuning methods for CLIP, such as CoOp and CoCoOp, demonstrate
relatively low performance on some fine-grained datasets. We recognize the
underlying reason is that these previous methods only projected global features
into the prompt, neglecting the various visual concepts, such as colors,
shapes, and sizes, which are naturally transferable across domains and play a
crucial role in generalization tasks. To address this issue, in this work, we
propose Concept-Guided Prompt Learning (CPL) for vision-language models.
Specifically, we leverage the well-learned knowledge of CLIP to create a visual
concept cache to enable concept-guided prompting. In order to refine the text
features, we further develop a projector that transforms multi-level visual
features into text features. We observe that this concept-guided prompt
learning approach is able to achieve enhanced consistency between visual and
linguistic modalities. Extensive experimental results demonstrate that our CPL
method significantly improves generalization capabilities compared to the
current state-of-the-art methods.
Related papers
- Conceptual Codebook Learning for Vision-Language Models [27.68834532978939]
We propose Codebook Learning (CoCoLe) to address the challenge of improving the generalization capability of vision-language models (VLMs)
We learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values.
We observe that this conceptual codebook learning method is able to achieve enhanced alignment between visual and linguistic modalities.
arXiv Detail & Related papers (2024-07-02T15:16:06Z) - UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment.
We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities.
With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - Improving In-Context Learning in Diffusion Models with Visual
Context-Modulated Prompts [83.03471704115786]
We introduce improved Prompt Diffusion (iPromptDiff) in this study.
iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector.
We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
arXiv Detail & Related papers (2023-12-03T14:15:52Z) - CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts [11.752632557524969]
We propose contrastive learning with data augmentation to disentangle content features from the original representations.
Our experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks.
arXiv Detail & Related papers (2023-11-28T03:00:59Z) - DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem.
Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - Contrastive Language-Image Pre-Training with Knowledge Graphs [33.211811772961234]
We propose a knowledge-based pre-training framework, dubbed Knowledge-CLIP, which injects semantic information into the widely used CLIP model.
Our model can semantically align the representations in vision and language with higher quality, and enhance the reasoning ability across scenarios and modalities.
arXiv Detail & Related papers (2022-10-17T09:49:22Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.