Context-Aware Prompt Tuning for Vision-Language Model with
Dual-Alignment
- URL: http://arxiv.org/abs/2309.04158v1
- Date: Fri, 8 Sep 2023 06:51:15 GMT
- Title: Context-Aware Prompt Tuning for Vision-Language Model with
Dual-Alignment
- Authors: Hongyu Hu, Tiancheng Lin, Jie Wang, Zhenbang Sun, Yi Xu
- Abstract summary: We introduce a novel method to improve the prompt learning of vision-language models by incorporating pre-trained large language models (LLMs)
With DuAl-PT, we propose to learn more context-aware prompts, benefiting from both explicit and implicit context modeling.
Empirically, DuAl-PT achieves superior performance on 11 downstream datasets on few-shot recognition and base-to-new generalization.
- Score: 15.180715595425864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visual
concepts from tedious training data, showing superb generalization ability.
Amount of prompt learning methods have been proposed to efficiently adapt the
VLMs to downstream tasks with only a few training samples. We introduce a novel
method to improve the prompt learning of vision-language models by
incorporating pre-trained large language models (LLMs), called Dual-Aligned
Prompt Tuning (DuAl-PT). Learnable prompts, like CoOp, implicitly model the
context through end-to-end training, which are difficult to control and
interpret. While explicit context descriptions generated by LLMs, like GPT-3,
can be directly used for zero-shot classification, such prompts are overly
relying on LLMs and still underexplored in few-shot domains. With DuAl-PT, we
propose to learn more context-aware prompts, benefiting from both explicit and
implicit context modeling. To achieve this, we introduce a pre-trained LLM to
generate context descriptions, and we encourage the prompts to learn from the
LLM's knowledge by alignment, as well as the alignment between prompts and
local image features. Empirically, DuAl-PT achieves superior performance on 11
downstream datasets on few-shot recognition and base-to-new generalization.
Hopefully, DuAl-PT can serve as a strong baseline. Code will be available.
Related papers
- Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP [24.22470408549266]
We dub prompt embedding as Aggregate-and-Adapted Prompt Embedding (AAPE)
AAPE is shown to be able to generalize to different downstream data distributions and tasks, including vision-language understanding tasks.
We also show AAPE is particularly helpful to handle non-canonical and OOD examples.
arXiv Detail & Related papers (2024-10-31T07:41:13Z) - IPO: Interpretable Prompt Optimization for Vision-Language Models [40.83071220530289]
This paper introduces a simple but interpretable prompt (IPO)
IPO utilizes large language models (LLMs) to generate textual prompts dynamically.
We incorporate a large multimodal model (LMM) to condition on visual content by generating image descriptions.
arXiv Detail & Related papers (2024-10-20T14:10:22Z) - Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models [26.964848679914354]
CoKnow is a framework that enhances Prompt Learning for Vision-Language Models with rich contextual knowledge.
We conducted extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods.
arXiv Detail & Related papers (2024-04-16T07:44:52Z) - Vision-Language Models Provide Promptable Representations for Reinforcement Learning [67.40524195671479]
We propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied reinforcement learning (RL)
We show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.
arXiv Detail & Related papers (2024-02-05T00:48:56Z) - Learning to Prompt with Text Only Supervision for Vision-Language Models [107.282881515667]
One branch of methods adapts CLIP by learning prompts using visual information.
An alternative approach resorts to training-free methods by generating class descriptions from large language models.
We propose to combine the strengths of both streams by learning prompts using only text data.
arXiv Detail & Related papers (2024-01-04T18:59:49Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Large Language Models are Good Prompt Learners for Low-Shot Image Classification [12.053713356249695]
We propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder.
Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification.
arXiv Detail & Related papers (2023-12-07T06:43:34Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.