TAP: Targeted Prompting for Task Adaptive Generation of Textual Training
Instances for Visual Classification
- URL: http://arxiv.org/abs/2309.06809v1
- Date: Wed, 13 Sep 2023 08:59:54 GMT
- Title: TAP: Targeted Prompting for Task Adaptive Generation of Textual Training
Instances for Visual Classification
- Authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Horst Possegger, Rogerio
Feris, Horst Bischof
- Abstract summary: Vision and Language Models (VLMs) have enabled visual recognition of a potentially unlimited set of categories described by text prompts.
For the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks.
- Score: 28.72126911321771
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision and Language Models (VLMs), such as CLIP, have enabled visual
recognition of a potentially unlimited set of categories described by text
prompts. However, for the best visual recognition performance, these models
still require tuning to better fit the data distributions of the downstream
tasks, in order to overcome the domain shift from the web-based pre-training
data. Recently, it has been shown that it is possible to effectively tune VLMs
without any paired data, and in particular to effectively improve VLMs visual
recognition performance using text-only training data generated by Large
Language Models (LLMs). In this paper, we dive deeper into this exciting
text-only VLM training approach and explore ways it can be significantly
further improved taking the specifics of the downstream task into account when
sampling text data from LLMs. In particular, compared to the SOTA text-only VLM
training approach, we demonstrate up to 8.4% performance improvement in (cross)
domain-specific adaptation, up to 8.7% improvement in fine-grained recognition,
and 3.1% overall average improvement in zero-shot classification compared to
strong baselines.
Related papers
- Bridge the Modality and Capability Gaps in Vision-Language Model Selection [62.26769826687365]
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names.
To better reuse the VLM resource, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo.
We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection.
We propose VLM Selection With gAp Bridging to mitigate the negative impact of two gaps.
arXiv Detail & Related papers (2024-03-20T17:54:58Z) - Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning [13.964106147449051]
Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets.
We propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT)
We demonstrate that our new approximations with semantic information are superior to representative capabilities.
arXiv Detail & Related papers (2024-02-04T04:42:05Z) - Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions [24.596929878045568]
We develop methods to train vision-language models (VLMs) with "bag-level" image-text supervision.
We use descriptions of categories generated by large language models (LLMs) and abundant, fine-grained image classification datasets.
Our findings suggest that geographic priors can be just as effective and are complementary to visual appearance.
arXiv Detail & Related papers (2024-01-04T08:39:13Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - Distribution-Aware Prompt Tuning for Vision-Language Models [20.02599087680773]
A key to prompt tuning is the feature space alignment between two modalities via learnable vectors with model parameters fixed.
Inspired by this observation, we proposed distribution-aware prompt tuning (DAPT) for vision-language models.
Our experiments on 11 benchmark datasets demonstrate that our method significantly improves generalizability.
arXiv Detail & Related papers (2023-09-06T23:49:11Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.