Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
- URL: http://arxiv.org/abs/2311.17095v4
- Date: Sat, 15 Jun 2024 14:20:34 GMT
- Title: Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
- Authors: Jiayun Luo, Siddhesh Khandelwal, Leonid Sigal, Boyang Li,
- Abstract summary: Large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question.
In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play-Vocabulary Semantic (OVSS) for this task.
- Score: 44.146292819267956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: From image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question answering. However, leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. PnP-OVSS does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+26.2% mIoU on Pascal VOC, +20.5% mIoU on MS COCO, +3.1% mIoU on COCO Stuff and +3.0% mIoU on ADE20K). Our codebase is at https://github.com/letitiabanana/PnP-OVSS.
Related papers
- Exploring Simple Open-Vocabulary Semantic Segmentation [7.245983878396646]
Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts.
In this paper, we introduce S-Seg, a novel model that can achieve surprisingly strong performance without depending on any of the above elements.
arXiv Detail & Related papers (2024-01-22T18:59:29Z) - CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor [18.288738950822342]
Mask labels are labor-intensive, which limits the number of categories in segmentation datasets.
We introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts.
Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples.
arXiv Detail & Related papers (2023-12-12T19:00:04Z) - Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion [24.02235805999193]
We propose a model capable of segmenting anything in a zero-shot manner without any annotations.
We introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks.
On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU.
arXiv Detail & Related papers (2023-08-23T23:44:44Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - Learning Open-vocabulary Semantic Segmentation Models From Natural
Language Supervision [49.905448429974804]
We consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories.
We propose a transformer-based model for OVS, termed as OVSegmentor, which exploits web-crawled image-text pairs for pre-training.
Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training.
arXiv Detail & Related papers (2023-01-22T13:10:05Z) - SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary
Semantic Segmentation [26.079055078561986]
We propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation.
The main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs.
Experimental results show that our model achieves comparable or superior segmentation accuracy.
arXiv Detail & Related papers (2022-11-27T12:38:52Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.