What's in a Name? Beyond Class Indices for Image Recognition
- URL: http://arxiv.org/abs/2304.02364v1
- Date: Wed, 5 Apr 2023 11:01:23 GMT
- Title: What's in a Name? Beyond Class Indices for Image Recognition
- Authors: Kai Han and Yandong Li and Sagar Vaze and Jie Li and Xuhui Jia
- Abstract summary: We propose a vision-language model to assign class names to images given only a large and essentially unconstrained vocabulary of categories as prior information.
Specifically, we propose iteratively clustering the data and voting on class names within them, showing that this enables a roughly 50% improvement over the baseline on ImageNet.
- Score: 31.68225941659493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing machine learning models demonstrate excellent performance in image
object recognition after training on a large-scale dataset under full
supervision. However, these models only learn to map an image to a predefined
class index, without revealing the actual semantic meaning of the object in the
image. In contrast, vision-language models like CLIP are able to assign
semantic class names to unseen objects in a `zero-shot' manner, although they
still rely on a predefined set of candidate names at test time. In this paper,
we reconsider the recognition problem and task a vision-language model to
assign class names to images given only a large and essentially unconstrained
vocabulary of categories as prior information. We use non-parametric methods to
establish relationships between images which allow the model to automatically
narrow down the set of possible candidate names. Specifically, we propose
iteratively clustering the data and voting on class names within them, showing
that this enables a roughly 50\% improvement over the baseline on ImageNet.
Furthermore, we tackle this problem both in unsupervised and partially
supervised settings, as well as with a coarse-grained and fine-grained search
space as the unconstrained dictionary.
Related papers
- Vocabulary-free Image Classification and Semantic Segmentation [71.78089106671581]
We introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an un-constrained language-induced semantic space to an input image without needing a known vocabulary.
VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories.
We propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database.
arXiv Detail & Related papers (2024-04-16T19:27:21Z) - Vocabulary-free Image Classification [75.38039557783414]
We formalize a novel task, termed as Vocabulary-free Image Classification (VIC)
VIC aims to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary.
CaSED is a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner.
arXiv Detail & Related papers (2023-06-01T17:19:43Z) - Learning to Name Classes for Vision and Language Models [57.0059455405424]
Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content.
We propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content.
By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names.
arXiv Detail & Related papers (2023-04-04T14:34:44Z) - Exploiting Category Names for Few-Shot Classification with
Vision-Language Models [78.51975804319149]
Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks.
This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head.
arXiv Detail & Related papers (2022-11-29T21:08:46Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Improving Few-shot Learning with Weakly-supervised Object Localization [24.3569501375842]
We propose a novel framework that generates class representations by extracting features from class-relevant regions of the images.
Our method outperforms the baseline few-shot model in miniImageNet and tieredImageNet benchmarks.
arXiv Detail & Related papers (2021-05-25T07:39:32Z) - Aligning Visual Prototypes with BERT Embeddings for Few-Shot Learning [48.583388368897126]
Few-shot learning is the task of learning to recognize previously unseen categories of images.
We propose a method that takes into account the names of the image classes.
arXiv Detail & Related papers (2021-05-21T08:08:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.