SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification
- URL: http://arxiv.org/abs/2211.16191v1
- Date: Mon, 28 Nov 2022 14:58:15 GMT
- Title: SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification
- Authors: Fang Peng, Xiaoshan Yang, Changsheng Xu
- Abstract summary: We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
- Score: 84.05253637260743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although significant progress has been made in few-shot learning, most of
existing few-shot learning methods require supervised pre-training on a large
amount of samples of base classes, which limits their generalization ability in
real world application. Recently, large-scale self-supervised vision-language
models (e.g., CLIP) have provided a new paradigm for transferable visual
representation learning. However, the pre-trained VLPs may neglect detailed
visual information that is difficult to describe by language sentences, but
important for learning an effective classifier in few-shot classification. To
address the above problem, we propose a new framework, named Semantic-guided
Visual Adapting (SgVA), which can effectively extend vision-language
pre-trained models to produce discriminative task-specific visual features by
comprehensively using a vision-specific contrastive loss, a cross-modal
contrastive loss, and an implicit knowledge distillation. The implicit
knowledge distillation is designed to transfer the fine-grained cross-modal
knowledge to guide the updating of the vision adapter. State-of-the-art results
on 13 datasets demonstrate that the adapted visual features can well complement
the cross-modal features to improve few-shot image classification.
Related papers
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - LaViP:Language-Grounded Visual Prompts [27.57227844809257]
We introduce a language-grounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks.
By capitalizing on language integration, we devise a parameter-efficient strategy to adjust the input of the visual encoder.
Our algorithm can operate even in black-box scenarios, showcasing adaptability in situations where access to the model's parameters is constrained.
arXiv Detail & Related papers (2023-12-18T05:50:10Z) - What Makes for Good Visual Tokenizers for Large Language Models? [26.488269091290597]
We investigate proper pre-training methods to build good visual tokenizers, making Large Language Models (LLMs) powerful Multimodal Large Language Models (MLLMs)
We discuss different visual tokenizers pre-trained with dominant methods (i.e., DeiT, CLIP, MAE, DINO)
We obtain a new MLLM equipped with a tailored Good Visual Tokenizer (GVT), which exhibits strong visual comprehension capability at multiple scales.
arXiv Detail & Related papers (2023-05-20T16:11:26Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Learning Visual Representations with Caption Annotations [19.24013129952071]
We propose a proxy task to learn visual representations over image-caption pairs.
ICMLM consists in predicting masked words in captions by relying on visual cues.
Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations.
arXiv Detail & Related papers (2020-08-04T08:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.