What's in a Name? Beyond Class Indices for Image Recognition
- URL: http://arxiv.org/abs/2304.02364v2
- Date: Sat, 27 Jul 2024 15:07:38 GMT
- Title: What's in a Name? Beyond Class Indices for Image Recognition
- Authors: Kai Han, Xiaohu Huang, Yandong Li, Sagar Vaze, Jie Li, Xuhui Jia,
- Abstract summary: We propose a vision-language model with assigning class names to images given only a large (essentially unconstrained) vocabulary of categories as prior information.
We leverage non-parametric methods to establish meaningful relationships between images, allowing the model to automatically narrow down the pool of candidate names.
Our method leads to a roughly 50% improvement over the baseline on ImageNet in the unsupervised setting.
- Score: 28.02490526407716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing machine learning models demonstrate excellent performance in image object recognition after training on a large-scale dataset under full supervision. However, these models only learn to map an image to a predefined class index, without revealing the actual semantic meaning of the object in the image. In contrast, vision-language models like CLIP are able to assign semantic class names to unseen objects in a 'zero-shot' manner, though they are once again provided a pre-defined set of candidate names at test-time. In this paper, we reconsider the recognition problem and task a vision-language model with assigning class names to images given only a large (essentially unconstrained) vocabulary of categories as prior information. We leverage non-parametric methods to establish meaningful relationships between images, allowing the model to automatically narrow down the pool of candidate names. Our proposed approach entails iteratively clustering the data and employing a voting mechanism to determine the most suitable class names. Additionally, we investigate the potential of incorporating additional textual features to enhance clustering performance. To achieve this, we employ the CLIP vision and text encoders to retrieve relevant texts from an external database, which can provide supplementary semantic information to inform the clustering process. Furthermore, we tackle this problem both in unsupervised and partially supervised settings, as well as with a coarse-grained and fine-grained search space as the unconstrained dictionary. Remarkably, our method leads to a roughly 50% improvement over the baseline on ImageNet in the unsupervised setting.
Related papers
- Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding.
To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm.
PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z) - Vocabulary-free Image Classification and Semantic Segmentation [71.78089106671581]
We introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an un-constrained language-induced semantic space to an input image without needing a known vocabulary.
VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories.
We propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database.
arXiv Detail & Related papers (2024-04-16T19:27:21Z) - Vocabulary-free Image Classification [75.38039557783414]
We formalize a novel task, termed as Vocabulary-free Image Classification (VIC)
VIC aims to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary.
CaSED is a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner.
arXiv Detail & Related papers (2023-06-01T17:19:43Z) - Learning to Name Classes for Vision and Language Models [57.0059455405424]
Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content.
We propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content.
By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names.
arXiv Detail & Related papers (2023-04-04T14:34:44Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - Exploiting Category Names for Few-Shot Classification with
Vision-Language Models [78.51975804319149]
Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks.
This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head.
arXiv Detail & Related papers (2022-11-29T21:08:46Z) - Semantic-Enhanced Image Clustering [6.218389227248297]
We propose to investigate the task of image clustering with the help of a visual-language pre-training model.
How to map images to a proper semantic space and how to cluster images from both image and semantic spaces are two key problems.
We propose a method to map the given images to a proper semantic space first and efficient methods to generate pseudo-labels according to the relationships between images and semantics.
arXiv Detail & Related papers (2022-08-21T09:04:21Z) - Improving Few-shot Learning with Weakly-supervised Object Localization [24.3569501375842]
We propose a novel framework that generates class representations by extracting features from class-relevant regions of the images.
Our method outperforms the baseline few-shot model in miniImageNet and tieredImageNet benchmarks.
arXiv Detail & Related papers (2021-05-25T07:39:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.