Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
- URL: http://arxiv.org/abs/2210.04150v3
- Date: Sat, 1 Apr 2023 19:00:47 GMT
- Title: Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
- Authors: Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang
Zhang, Peizhao Zhang, Peter Vajda, Diana Marculescu
- Abstract summary: Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training.
Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions.
We propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions.
In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-
- Score: 45.81698881151867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-vocabulary semantic segmentation aims to segment an image into semantic
regions according to text descriptions, which may not have been seen during
training. Recent two-stage methods first generate class-agnostic mask proposals
and then leverage pre-trained vision-language models, e.g., CLIP, to classify
masked regions. We identify the performance bottleneck of this paradigm to be
the pre-trained CLIP model, since it does not perform well on masked images. To
address this, we propose to finetune CLIP on a collection of masked image
regions and their corresponding text descriptions. We collect training data by
mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to
match masked image regions to nouns in the image captions. Compared with the
more precise and manually annotated segmentation labels with fixed classes
(e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain
CLIP's generalization ability. Along with finetuning the entire model, we
utilize the "blank" areas in masked images using a method we dub mask prompt
tuning. Experiments demonstrate mask prompt tuning brings significant
improvement without modifying any weights of CLIP, and it can further improve a
fully finetuned model. In particular, when trained on COCO and evaluated on
ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the
previous state-of-the-art. For the first time, open-vocabulary generalist
models match the performance of supervised specialist models in 2017 without
dataset-specific adaptations.
Related papers
- MaskInversion: Localized Embeddings via Optimization of Explainability Maps [49.50785637749757]
MaskInversion generates a context-aware embedding for a query image region specified by a mask at test time.
It can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation.
arXiv Detail & Related papers (2024-07-29T14:21:07Z) - Region-Adaptive Transform with Segmentation Prior for Image Compression [105.17604572081177]
We introduce the class-agnostic segmentation masks for extracting region-adaptive contextual information.
Our proposed module, Region-Adaptive Transform, applies adaptive convolutions on different regions guided by the masks.
We also introduce a plug-and-play module named Affine Layer to incorporate rich contexts from various regions.
arXiv Detail & Related papers (2024-03-01T16:03:37Z) - Exploring Simple Open-Vocabulary Semantic Segmentation [7.245983878396646]
Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts.
In this paper, we introduce S-Seg, a novel model that can achieve surprisingly strong performance without depending on any of the above elements.
arXiv Detail & Related papers (2024-01-22T18:59:29Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Less is More: Removing Text-regions Improves CLIP Training Efficiency
and Robustness [19.77762574325687]
The CLIP (Contrastive Language-Image Pre-training) model and its variants are becoming the de facto backbone in many applications.
We discuss two effective approaches to improve the efficiency and robustness of CLIP training.
Our filter-based CLIP model demonstrates a top-1 accuracy of 68.78%, outperforming previous models whose accuracy was all below 50%.
arXiv Detail & Related papers (2023-05-08T23:47:07Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.