Vocabulary-free Image Classification
- URL: http://arxiv.org/abs/2306.00917v3
- Date: Fri, 12 Jan 2024 15:34:55 GMT
- Title: Vocabulary-free Image Classification
- Authors: Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota,
Yiming Wang, Elisa Ricci
- Abstract summary: We formalize a novel task, termed as Vocabulary-free Image Classification (VIC)
VIC aims to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary.
CaSED is a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner.
- Score: 75.38039557783414
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in large vision-language models have revolutionized the image
classification paradigm. Despite showing impressive zero-shot capabilities, a
pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time
for composing the textual prompts. However, such assumption can be impractical
when the semantic context is unknown and evolving. We thus formalize a novel
task, termed as Vocabulary-free Image Classification (VIC), where we aim to
assign to an input image a class that resides in an unconstrained
language-induced semantic space, without the prerequisite of a known
vocabulary. VIC is a challenging task as the semantic space is extremely large,
containing millions of concepts, with hard-to-discriminate fine-grained
categories. In this work, we first empirically verify that representing this
semantic space by means of an external vision-language database is the most
effective way to obtain semantically relevant content for classifying the
image. We then propose Category Search from External Databases (CaSED), a
method that exploits a pre-trained vision-language model and an external
vision-language database to address VIC in a training-free manner. CaSED first
extracts a set of candidate categories from captions retrieved from the
database based on their semantic similarity to the image, and then assigns to
the image the best matching candidate category according to the same
vision-language model. Experiments on benchmark datasets validate that CaSED
outperforms other complex vision-language frameworks, while being efficient
with much fewer parameters, paving the way for future research in this
direction.
Related papers
- Vocabulary-free Image Classification and Semantic Segmentation [71.78089106671581]
We introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an un-constrained language-induced semantic space to an input image without needing a known vocabulary.
VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories.
We propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database.
arXiv Detail & Related papers (2024-04-16T19:27:21Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - Cross-Modal Concept Learning and Inference for Vision-Language Models [31.463771883036607]
In existing fine-tuning methods, the class-specific text description is matched against the whole image.
We develop a new method called cross-model concept learning and inference (CCLI)
Our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts.
arXiv Detail & Related papers (2023-07-28T10:26:28Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - What's in a Name? Beyond Class Indices for Image Recognition [28.02490526407716]
We propose a vision-language model with assigning class names to images given only a large (essentially unconstrained) vocabulary of categories as prior information.
We leverage non-parametric methods to establish meaningful relationships between images, allowing the model to automatically narrow down the pool of candidate names.
Our method leads to a roughly 50% improvement over the baseline on ImageNet in the unsupervised setting.
arXiv Detail & Related papers (2023-04-05T11:01:23Z) - Natural Scene Image Annotation Using Local Semantic Concepts and Spatial
Bag of Visual Words [0.0]
This paper introduces a framework for automatically annotating natural scene images with local semantic labels from a predefined vocabulary.
The framework is based on a hypothesis that assumes that, in natural scenes, intermediate semantic concepts are correlated with the local keypoints.
Based on this hypothesis, image regions can be efficiently represented by BOW model and using a machine learning approach, such as SVM, to label image regions with semantic annotations.
arXiv Detail & Related papers (2022-10-17T12:57:51Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z) - Deep Semantic Dictionary Learning for Multi-label Image Classification [3.3989824361632337]
We present an innovative path towards the solution of the multi-label image classification which considers it as a dictionary learning task.
A novel end-to-end model named Deep Semantic Dictionary Learning (DSDL) is designed.
Our codes and models have been released.
arXiv Detail & Related papers (2020-12-23T06:22:47Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.