Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong
Few-shot Learners
- URL: http://arxiv.org/abs/2303.02151v1
- Date: Fri, 3 Mar 2023 18:58:16 GMT
- Title: Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong
Few-shot Learners
- Authors: Renrui Zhang, Xiangfei Hu, Bohao Li, Siyuan Huang, Hanqiu Deng,
Hongsheng Li, Yu Qiao, Peng Gao
- Abstract summary: We propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre-training paradigms for better few-shot learning.
Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision-generative knowledge, and GPT-3's language-generative knowledge.
- Score: 55.119101947682715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual recognition in low-data regimes requires deep neural networks to learn
generalized representations from limited training samples. Recently, CLIP-based
methods have shown promising few-shot performance benefited from the
contrastive language-image pre-training. We then question, if the more diverse
pre-training knowledge can be cascaded to further assist few-shot
representation learning. In this paper, we propose CaFo, a Cascade of
Foundation models that incorporates diverse prior knowledge of various
pre-training paradigms for better few-shot learning. Our CaFo incorporates
CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge,
DALL-E's vision-generative knowledge, and GPT-3's language-generative
knowledge. Specifically, CaFo works by 'Prompt, Generate, then Cache'. Firstly,
we leverage GPT-3 to produce textual inputs for prompting CLIP with rich
downstream linguistic semantics. Then, we generate synthetic images via DALL-E
to expand the few-shot training data without any manpower. At last, we
introduce a learnable cache model to adaptively blend the predictions from CLIP
and DINO. By such collaboration, CaFo can fully unleash the potential of
different pre-training methods and unify them to perform state-of-the-art for
few-shot classification. Code is available at
https://github.com/ZrrSkywalker/CaFo.
Related papers
- Knowledge Adaptation Network for Few-Shot Class-Incremental Learning [23.90555521006653]
Few-shot class-incremental learning aims to incrementally recognize new classes using a few samples.
One of the effective methods to solve this challenge is to construct prototypical evolution classifiers.
Because representations for new classes are weak and biased, we argue such a strategy is suboptimal.
arXiv Detail & Related papers (2024-09-18T07:51:38Z) - Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning [56.29097276129473]
We propose a simple yet effective framework, named Learning Prompt with Distribution-based Feature Replay (LP-DiF)
To prevent the learnable prompt from forgetting old knowledge in the new session, we propose a pseudo-feature replay approach.
When progressing to a new session, pseudo-features are sampled from old-class distributions combined with training images of the current session to optimize the prompt.
arXiv Detail & Related papers (2024-01-03T07:59:17Z) - Collaboration of Pre-trained Models Makes Better Few-shot Learner [49.89134194181042]
Few-shot classification requires deep neural networks to learn generalized representations only from limited training images.
Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training.
We propose CoMo, a Collaboration of pre-trained Models that incorporates diverse prior knowledge from various pre-training paradigms for better few-shot learning.
arXiv Detail & Related papers (2022-09-25T16:23:12Z) - Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification [58.06983806317233]
Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs.
To enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules.
We propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter.
arXiv Detail & Related papers (2022-07-19T19:12:11Z) - OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal
Regression [94.28253749970534]
We propose to learn the rank concepts from the rich semantic CLIP latent space.
OrdinalCLIP consists of learnable context tokens and learnable rank embeddings.
Experimental results show that our paradigm achieves competitive performance in general ordinal regression tasks.
arXiv Detail & Related papers (2022-06-06T03:54:53Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - REALM: Retrieval-Augmented Language Model Pre-Training [37.3178586179607]
We augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia.
For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner.
We demonstrate the effectiveness of Retrieval-Augmented Language Model pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA)
arXiv Detail & Related papers (2020-02-10T18:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.