Learning to Decompose Visual Features with Latent Textual Prompts
- URL: http://arxiv.org/abs/2210.04287v1
- Date: Sun, 9 Oct 2022 15:40:13 GMT
- Title: Learning to Decompose Visual Features with Latent Textual Prompts
- Authors: Feng Wang, Manling Li, Xudong Lin, Hairong Lv, Alexander G. Schwing
and Heng Ji
- Abstract summary: We propose Decomposed Feature Prompting (DeFo) to improve vision-language models.
Our empirical study shows DeFo's significance in improving the vision-language models.
- Score: 140.2117637223449
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in pre-training vision-language models like CLIP have shown
great potential in learning transferable visual representations. Nonetheless,
for downstream inference, CLIP-like models suffer from either 1) degraded
accuracy and robustness in the case of inaccurate text descriptions during
retrieval-based inference (the challenge for zero-shot protocol); or 2)
breaking the well-established vision-language alignment (the challenge for
linear probing). To address them, we propose Decomposed Feature Prompting
(DeFo). DeFo leverages a flexible number of learnable embeddings as textual
input while maintaining the vision-language dual-model architecture, which
enables the model to learn decomposed visual features with the help of
feature-level textual prompts. We further use an additional linear layer to
perform classification, allowing a scalable size of language inputs. Our
empirical study shows DeFo's significance in improving the vision-language
models. For example, DeFo obtains 73.2% test accuracy on ImageNet with a
ResNet-50 backbone without tuning any pretrained weights of both the vision and
language encoder, outperforming zero-shot CLIP by a large margin of 15.0%, and
outperforming state-of-the-art vision-language prompt tuning method by 7.6%.
Related papers
- SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.