Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?
- URL: http://arxiv.org/abs/2307.11978v1
- Date: Sat, 22 Jul 2023 04:20:30 GMT
- Title: Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?
- Authors: Cheng-En Wu, Yu Tian, Haichao Yu, Heng Wang, Pedro Morgado, Yu Hen Hu,
Linjie Yang
- Abstract summary: A vision-language model can be adapted to a new classification task through few-shot prompt tuning.
We study the key reasons contributing to the robustness of the prompt tuning paradigm.
We demonstrate that noisy zero-shot predictions from CLIP can be used to tune its own prompt.
- Score: 41.56585313495218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models such as CLIP learn a generic text-image embedding from
large-scale training data. A vision-language model can be adapted to a new
classification task through few-shot prompt tuning. We find that such a prompt
tuning process is highly robust to label noises. This intrigues us to study the
key reasons contributing to the robustness of the prompt tuning paradigm. We
conducted extensive experiments to explore this property and find the key
factors are: 1) the fixed classname tokens provide a strong regularization to
the optimization of the model, reducing gradients induced by the noisy samples;
2) the powerful pre-trained image-text embedding that is learned from diverse
and generic web data provides strong prior knowledge for image classification.
Further, we demonstrate that noisy zero-shot predictions from CLIP can be used
to tune its own prompt, significantly enhancing prediction accuracy in the
unsupervised setting. The code is available at https://github.com/CEWu/PTNL.
Related papers
- Vision-Language Models are Strong Noisy Label Detectors [76.07846780815794]
This paper presents a Denoising Fine-Tuning framework, called DeFT, for adapting vision-language models.
DeFT utilizes the robust alignment of textual and visual features pre-trained on millions of auxiliary image-text pairs to sieve out noisy labels.
Experimental results on seven synthetic and real-world noisy datasets validate the effectiveness of DeFT in both noisy label detection and image classification.
arXiv Detail & Related papers (2024-09-29T12:55:17Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models [48.77653835765705]
We introduce a probabilistic resolution to prompt tuning, where the label-specific prompts are generated hierarchically by first sampling a latent vector from an underlying distribution and then employing a lightweight generative model.
We evaluate the effectiveness of our approach on four tasks: few-shot image recognition, base-to-new generalization, dataset transfer learning, and domain shifts.
arXiv Detail & Related papers (2023-03-16T06:09:15Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model [39.722927180264584]
We propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously.
To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning scheme is proposed.
arXiv Detail & Related papers (2022-08-17T15:06:36Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.