Vocabulary-free Image Classification and Semantic Segmentation
- URL: http://arxiv.org/abs/2404.10864v1
- Date: Tue, 16 Apr 2024 19:27:21 GMT
- Title: Vocabulary-free Image Classification and Semantic Segmentation
- Authors: Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci,
- Abstract summary: We introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an un-constrained language-induced semantic space to an input image without needing a known vocabulary.
VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories.
We propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database.
- Score: 71.78089106671581
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an unconstrained language-induced semantic space to an input image without needing a known vocabulary. VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories. To address VIC, we propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database. CaSED first extracts the set of candidate categories from the most semantically similar captions in the database and then assigns the image to the best-matching candidate category according to the same vision-language model. Furthermore, we demonstrate that CaSED can be applied locally to generate a coarse segmentation mask that classifies image regions, introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its variants outperform other more complex vision-language models, on classification and semantic segmentation benchmarks, while using much fewer parameters.
Related papers
- SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic
Segmentation [36.41778553250247]
Weakly-Supervised Semantic (WSSS) aims to train segmentation models using image data with only image-level supervision.
We propose a Semantic Prompt Learning for WSSS (SemPLeS) framework, which learns to effectively prompt the CLIP latent space.
SemPLeS can perform better semantic alignment between object regions and the associated class labels.
arXiv Detail & Related papers (2024-01-22T09:41:05Z) - Auto-Vocabulary Semantic Segmentation [13.410217680999462]
We introduce textitAuto-Vocabulary Semantics (AVS), advancing open-ended image understanding.
Our framework autonomously identifies relevant class names using enhanced BLIP embedding.
Our method sets new benchmarks on datasets such as PASCAL VOC and Context, ADE20K, and Cityscapes for AVS.
arXiv Detail & Related papers (2023-12-07T18:55:52Z) - Leveraging multilingual transfer for unsupervised semantic acoustic word
embeddings [23.822788597966646]
Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech segments that encode phonetic content.
In this paper we explore semantic AWE modelling.
We show -- for the first time -- that AWEs can be used for downstream semantic query-by-example search.
arXiv Detail & Related papers (2023-07-05T07:46:54Z) - Vocabulary-free Image Classification [75.38039557783414]
We formalize a novel task, termed as Vocabulary-free Image Classification (VIC)
VIC aims to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary.
CaSED is a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner.
arXiv Detail & Related papers (2023-06-01T17:19:43Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - IFSeg: Image-free Semantic Segmentation via Vision-Language Model [67.62922228676273]
We introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories.
We construct this artificial training data by creating a 2D map of random semantic categories and another map of their corresponding word tokens.
Our model not only establishes an effective baseline for this novel task but also demonstrates strong performances compared to existing methods.
arXiv Detail & Related papers (2023-03-25T08:19:31Z) - Natural Scene Image Annotation Using Local Semantic Concepts and Spatial
Bag of Visual Words [0.0]
This paper introduces a framework for automatically annotating natural scene images with local semantic labels from a predefined vocabulary.
The framework is based on a hypothesis that assumes that, in natural scenes, intermediate semantic concepts are correlated with the local keypoints.
Based on this hypothesis, image regions can be efficiently represented by BOW model and using a machine learning approach, such as SVM, to label image regions with semantic annotations.
arXiv Detail & Related papers (2022-10-17T12:57:51Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.