Related papers: Decoupling Zero-Shot Semantic Segmentation

Decoupling Zero-Shot Semantic Segmentation

URL: http://arxiv.org/abs/2112.07910v1
Date: Wed, 15 Dec 2021 06:21:47 GMT
Title: Decoupling Zero-Shot Semantic Segmentation
Authors: Jian Ding, Nan Xue, Gui-Song Xia, Dengxin Dai
Abstract summary: Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not been seen in the training. We propose a simple and effective zero-shot semantic segmentation model, called ZegFormer.
Score: 46.55494691004304
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not been seen in the training. Existing works formulate ZS3 as a pixel-level zero-shot classification problem, and transfer semantic knowledge from seen classes to unseen ones with the help of language models pre-trained only with texts. While simple, the pixel-level ZS3 formulation shows the limited capability to integrate vision-language models that are often pre-trained with image-text pairs and currently demonstrate great potential for vision tasks. Inspired by the observation that humans often perform segment-level semantic labeling, we propose to decouple the ZS3 into two sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments. 2) a zero-shot classification task on segments. The former sub-task does not involve category information and can be directly transferred to group pixels for unseen classes. The latter subtask performs at segment-level and provides a natural way to leverage large-scale vision-language models pre-trained with image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we propose a simple and effective zero-shot semantic segmentation model, called ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff in terms of mIoU for unseen classes. Code will be released at https://github.com/dingjiansw101/ZegFormer.

Related papers

Bridge the Gap Between Visual and Linguistic Comprehension for Generalized Zero-shot Semantic Segmentation [39.17707407384492]
Generalized zero-shot semantic segmentation (GZS3) aims to achieve the human-level capability of segmenting seen and unseen classes. We propose a Decoupled Vision-Language Matching (DeVLMatch) framework, composed of spatial-part (SPMatch) and channel-state (CSMatch) matching modules.
arXiv Detail & Related papers (2025-03-31T07:39:14Z)
Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models. ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image. Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z)
A Language-Guided Benchmark for Weakly Supervised Open Vocabulary Semantic Segmentation [10.054960979867584]
We propose a novel weakly supervised OVSS pipeline that can perform ZSS, FSS and Cross-dataset segmentation on novel classes. The proposed pipeline beats existing methods for weak generalized Zero-Shot and weak Few-Shot semantic segmentation by 39 and 3 mIOU points respectively.
arXiv Detail & Related papers (2023-02-27T21:55:48Z)
Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision [49.905448429974804]
We consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. We propose a transformer-based model for OVS, termed as OVSegmentor, which exploits web-crawled image-text pairs for pre-training. Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training.
arXiv Detail & Related papers (2023-01-22T13:10:05Z)
Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts. We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query. Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z)
A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
Generative Zero-Shot Learning for Semantic Segmentation of 3D Point Cloud [79.99653758293277]
We present the first generative approach for both Zero-Shot Learning (ZSL) and Generalized ZSL (GZSL) on 3D data. We show that it reaches or outperforms the state of the art on ModelNet40 classification for both inductive ZSL and inductive GZSL. Our experiments show that our method outperforms strong baselines, which we additionally propose for this task.
arXiv Detail & Related papers (2021-08-13T13:29:27Z)
From Pixel to Patch: Synthesize Context-aware Features for Zero-shot Semantic Segmentation [22.88452754438478]
We focus on zero-shot semantic segmentation, which aims to segment unseen objects with only category-level semantic representations. We propose a novel Context-aware feature Generation Network (CaGNet), which can synthesize context-aware pixel-wise visual features for unseen categories. Experimental results on Pascal-VOC, Pascal-Context, and COCO-stuff show that our method significantly outperforms the existing zero-shot semantic segmentation methods.
arXiv Detail & Related papers (2020-09-25T13:26:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.