Exploiting Category Names for Few-Shot Classification with
Vision-Language Models
- URL: http://arxiv.org/abs/2211.16594v3
- Date: Tue, 18 Apr 2023 22:56:39 GMT
- Title: Exploiting Category Names for Few-Shot Classification with
Vision-Language Models
- Authors: Taihong Xiao, Zirui Wang, Liangliang Cao, Jiahui Yu, Shengyang Dai,
Ming-Hsuan Yang
- Abstract summary: Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks.
This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head.
- Score: 78.51975804319149
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-language foundation models pretrained on large-scale data provide a
powerful tool for many visual understanding tasks. Notably, many
vision-language models build two encoders (visual and textual) that can map two
modalities into the same embedding space. As a result, the learned
representations achieve good zero-shot performance on tasks like image
classification. However, when there are only a few examples per category, the
potential of large vision-language models is often underperformed, mainly due
to the gap between a large number of parameters and a relatively small amount
of training data. This paper shows that we can significantly improve the
performance of few-shot classification by using the category names to
initialize the classification head. With the proposed category name
initialization method, our model obtains the state-of-the-art performance on a
number of few-shot image classification benchmarks (e.g., 87.37% on ImageNet
and 96.08% on Stanford Cars, both using five-shot learning).
Related papers
- Circles: Inter-Model Comparison of Multi-Classification Problems with
High Number of Classes [0.24554686192257422]
We present our interactive visual analytics tool, called Circles, that allows a visual inter-model comparison of numerous classification models with 1K classes in one view.
Our prototype shows the results of 9 models with 1K classes.
arXiv Detail & Related papers (2023-09-08T19:39:46Z) - Vocabulary-free Image Classification [75.38039557783414]
We formalize a novel task, termed as Vocabulary-free Image Classification (VIC)
VIC aims to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary.
CaSED is a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner.
arXiv Detail & Related papers (2023-06-01T17:19:43Z) - What's in a Name? Beyond Class Indices for Image Recognition [28.02490526407716]
We propose a vision-language model with assigning class names to images given only a large (essentially unconstrained) vocabulary of categories as prior information.
We leverage non-parametric methods to establish meaningful relationships between images, allowing the model to automatically narrow down the pool of candidate names.
Our method leads to a roughly 50% improvement over the baseline on ImageNet in the unsupervised setting.
arXiv Detail & Related papers (2023-04-05T11:01:23Z) - Learning to Name Classes for Vision and Language Models [57.0059455405424]
Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content.
We propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content.
By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names.
arXiv Detail & Related papers (2023-04-04T14:34:44Z) - Semantic Representation and Dependency Learning for Multi-Label Image
Recognition [76.52120002993728]
We propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category.
Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model.
We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions.
arXiv Detail & Related papers (2022-04-08T00:55:15Z) - Multi-Label Image Classification with Contrastive Learning [57.47567461616912]
We show that a direct application of contrastive learning can hardly improve in multi-label cases.
We propose a novel framework for multi-label classification with contrastive learning in a fully supervised setting.
arXiv Detail & Related papers (2021-07-24T15:00:47Z) - Improving Few-shot Learning with Weakly-supervised Object Localization [24.3569501375842]
We propose a novel framework that generates class representations by extracting features from class-relevant regions of the images.
Our method outperforms the baseline few-shot model in miniImageNet and tieredImageNet benchmarks.
arXiv Detail & Related papers (2021-05-25T07:39:32Z) - Aligning Visual Prototypes with BERT Embeddings for Few-Shot Learning [48.583388368897126]
Few-shot learning is the task of learning to recognize previously unseen categories of images.
We propose a method that takes into account the names of the image classes.
arXiv Detail & Related papers (2021-05-21T08:08:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.