Collaboration of Pre-trained Models Makes Better Few-shot Learner
- URL: http://arxiv.org/abs/2209.12255v1
- Date: Sun, 25 Sep 2022 16:23:12 GMT
- Title: Collaboration of Pre-trained Models Makes Better Few-shot Learner
- Authors: Renrui Zhang, Hanqiu Deng, Bohao Li, Wei Zhang, Hao Dong, Hongsheng
Li, Peng Gao, Yu Qiao
- Abstract summary: Few-shot classification requires deep neural networks to learn generalized representations only from limited training images.
Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training.
We propose CoMo, a Collaboration of pre-trained Models that incorporates diverse prior knowledge from various pre-training paradigms for better few-shot learning.
- Score: 49.89134194181042
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Few-shot classification requires deep neural networks to learn generalized
representations only from limited training images, which is challenging but
significant in low-data regimes. Recently, CLIP-based methods have shown
promising few-shot performance benefited from the contrastive language-image
pre-training. Based on this point, we question if the large-scale pre-training
can alleviate the few-shot data deficiency and also assist the representation
learning by the pre-learned knowledge. In this paper, we propose CoMo, a
Collaboration of pre-trained Models that incorporates diverse prior knowledge
from various pre-training paradigms for better few-shot learning. Our CoMo
includes: CLIP's language-contrastive knowledge, DINO's vision-contrastive
knowledge, and DALL-E's language-generative knowledge. Specifically, CoMo works
in two aspects: few-shot data expansion and diverse knowledge ensemble. For
one, we generate synthetic images via zero-shot DALL-E to enrich the few-shot
training data without any manpower. For the other, we introduce a learnable
Multi-Knowledge Adapter (MK-Adapter) to adaptively blend the predictions from
CLIP and DINO. By such collaboration, CoMo can fully unleash the potential of
different pre-training methods and unify them to perform state-of-the-art for
few-shot classification. We conduct extensive experiments on 11 datasets to
demonstrate the superiority and generalization ability of our approach.
Related papers
- Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model [43.738677778740325]
We propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle.
Candle achieves state-of-the-art performance over extensive experiments on 11 diverse datasets.
arXiv Detail & Related papers (2024-06-18T14:07:13Z) - Less is More: A Closer Look at Semantic-based Few-Shot Learning [11.724194320966959]
Few-shot Learning aims to learn and distinguish new categories with a very limited number of available images.
We propose a simple but effective framework for few-shot learning tasks, specifically designed to exploit the textual information and language model.
Our experiments conducted across four widely used few-shot datasets demonstrate that our simple framework achieves impressive results.
arXiv Detail & Related papers (2024-01-10T08:56:02Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong
Few-shot Learners [55.119101947682715]
We propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre-training paradigms for better few-shot learning.
Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision-generative knowledge, and GPT-3's language-generative knowledge.
arXiv Detail & Related papers (2023-03-03T18:58:16Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Contrastive Language-Image Pre-Training with Knowledge Graphs [33.211811772961234]
We propose a knowledge-based pre-training framework, dubbed Knowledge-CLIP, which injects semantic information into the widely used CLIP model.
Our model can semantically align the representations in vision and language with higher quality, and enhance the reasoning ability across scenarios and modalities.
arXiv Detail & Related papers (2022-10-17T09:49:22Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Curriculum Meta-Learning for Few-shot Classification [1.5039745292757671]
We propose an adaptation of the curriculum training framework, applicable to state-of-the-art meta learning techniques for few-shot classification.
Our experiments with the MAML algorithm on two few-shot image classification tasks show significant gains with the curriculum training framework.
arXiv Detail & Related papers (2021-12-06T10:29:23Z) - A Competence-aware Curriculum for Visual Concepts Learning via Question
Answering [95.35905804211698]
We propose a competence-aware curriculum for visual concept learning in a question-answering manner.
We design a neural-symbolic concept learner for learning the visual concepts and a multi-dimensional Item Response Theory (mIRT) model for guiding the learning process.
Experimental results on CLEVR show that with a competence-aware curriculum, the proposed method achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-07-03T05:08:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.