LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of
Vision & Language Models
- URL: http://arxiv.org/abs/2210.01115v2
- Date: Sun, 2 Apr 2023 18:03:06 GMT
- Title: LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of
Vision & Language Models
- Authors: Adrian Bulat and Georgios Tzimiropoulos
- Abstract summary: We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting.
LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available.
LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
- Score: 67.19124099815645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Soft prompt learning has recently emerged as one of the methods of choice for
adapting V&L models to a downstream task using a few training examples.
However, current methods significantly overfit the training data, suffering
from large accuracy degradation when tested on unseen classes from the same
domain. To this end, in this paper, we make the following 4 contributions: (1)
To alleviate base class overfitting, we propose a novel Language-Aware Soft
Prompting (LASP) learning method by means of a text-to-text cross-entropy loss
that maximizes the probability of the learned prompts to be correctly
classified with respect to pre-defined hand-crafted textual prompts. (2) To
increase the representation capacity of the prompts, we propose grouped LASP
where each group of prompts is optimized with respect to a separate subset of
textual prompts. (3) We identify a visual-language misalignment introduced by
prompt learning and LASP, and more importantly, propose a re-calibration
mechanism to address it. (4) We show that LASP is inherently amenable to
including, during training, virtual classes, i.e. class names for which no
visual samples are available, further increasing the robustness of the learned
prompts. Through evaluations on 11 datasets, we show that our approach (a)
significantly outperforms all prior works on soft prompting, and (b) matches
and surpasses, for the first time, the accuracy on novel classes obtained by
hand-crafted prompts and CLIP for 8 out of 11 test datasets. Code will be made
available at https://www.adrianbulat.com/lasp
Related papers
- Mixture of Prompt Learning for Vision Language Models [12.828490399811376]
We propose a mixture of soft prompt learning method incorporating a routing module.
This module is able to capture a dataset's varied styles and dynamically selects the most suitable prompts for each instance.
We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group.
arXiv Detail & Related papers (2024-09-18T14:25:02Z) - Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - Can Better Text Semantics in Prompt Tuning Improve VLM Generalization? [28.041879000565874]
We introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models.
Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts.
Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods.
arXiv Detail & Related papers (2024-05-13T16:52:17Z) - Learning to Prompt with Text Only Supervision for Vision-Language Models [107.282881515667]
One branch of methods adapts CLIP by learning prompts using visual information.
An alternative approach resorts to training-free methods by generating class descriptions from large language models.
We propose to combine the strengths of both streams by learning prompts using only text data.
arXiv Detail & Related papers (2024-01-04T18:59:49Z) - PRE: Vision-Language Prompt Learning with Reparameterization Encoder [24.855142164168605]
Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks.
To attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions.
To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens.
arXiv Detail & Related papers (2023-09-14T14:48:01Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.