IFSeg: Image-free Semantic Segmentation via Vision-Language Model
- URL: http://arxiv.org/abs/2303.14396v1
- Date: Sat, 25 Mar 2023 08:19:31 GMT
- Title: IFSeg: Image-free Semantic Segmentation via Vision-Language Model
- Authors: Sukmin Yun, Seong Hyeon Park, Paul Hongsuck Seo, Jinwoo Shin
- Abstract summary: We introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories.
We construct this artificial training data by creating a 2D map of random semantic categories and another map of their corresponding word tokens.
Our model not only establishes an effective baseline for this novel task but also demonstrates strong performances compared to existing methods.
- Score: 67.62922228676273
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language (VL) pre-training has recently gained much attention for its
transferability and flexibility in novel concepts (e.g., cross-modality
transfer) across various visual tasks. However, VL-driven segmentation has been
under-explored, and the existing approaches still have the burden of acquiring
additional training images or even segmentation annotations to adapt a VL model
to downstream segmentation tasks. In this paper, we introduce a novel
image-free segmentation task where the goal is to perform semantic segmentation
given only a set of the target semantic categories, but without any
task-specific images and annotations. To tackle this challenging task, our
proposed method, coined IFSeg, generates VL-driven artificial
image-segmentation pairs and updates a pre-trained VL model to a segmentation
task. We construct this artificial training data by creating a 2D map of random
semantic categories and another map of their corresponding word tokens. Given
that a pre-trained VL model projects visual and text tokens into a common space
where tokens that share the semantics are located closely, this artificially
generated word map can replace the real image inputs for such a VL model.
Through an extensive set of experiments, our model not only establishes an
effective baseline for this novel task but also demonstrates strong
performances compared to existing methods that rely on stronger supervision,
such as task-specific images and segmentation masks. Code is available at
https://github.com/alinlab/ifseg.
Related papers
- Vocabulary-free Image Classification and Semantic Segmentation [71.78089106671581]
We introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an un-constrained language-induced semantic space to an input image without needing a known vocabulary.
VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories.
We propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database.
arXiv Detail & Related papers (2024-04-16T19:27:21Z) - Shatter and Gather: Learning Referring Image Segmentation with Text
Supervision [52.46081425504072]
We present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent.
Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
arXiv Detail & Related papers (2023-08-29T15:39:15Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Self-supervised Pre-training for Semantic Segmentation in an Indoor
Scene [8.357801312689622]
We propose RegConsist, a method for self-supervised pre-training of a semantic segmentation model.
We use a variant of contrastive learning to train a DCNN model for predicting semantic segmentation from RGB views in the target environment.
The proposed method outperforms models pre-trained on ImageNet and achieves competitive performance when using models that are trained for exactly the same task but on a different dataset.
arXiv Detail & Related papers (2022-10-04T20:10:14Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.