Diffusion Models for Zero-Shot Open-Vocabulary Segmentation
- URL: http://arxiv.org/abs/2306.09316v1
- Date: Thu, 15 Jun 2023 17:51:28 GMT
- Title: Diffusion Models for Zero-Shot Open-Vocabulary Segmentation
- Authors: Laurynas Karazija, Iro Laina, Andrea Vedaldi, Christian Rupprecht
- Abstract summary: This paper proposes a new method for zero-shot open-vocabulary segmentation.
We leverage the generative properties of large-scale text-to-image diffusion models to sample a set of support images.
We show that our method can be used to ground several existing pre-trained self-supervised feature extractors in natural language.
- Score: 97.25882784890456
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The variety of objects in the real world is nearly unlimited and is thus
impossible to capture using models trained on a fixed set of categories. As a
result, in recent years, open-vocabulary methods have attracted the interest of
the community. This paper proposes a new method for zero-shot open-vocabulary
segmentation. Prior work largely relies on contrastive training using
image-text pairs, leveraging grouping mechanisms to learn image features that
are both aligned with language and well-localised. This however can introduce
ambiguity as the visual appearance of images with similar captions often
varies. Instead, we leverage the generative properties of large-scale
text-to-image diffusion models to sample a set of support images for a given
textual category. This provides a distribution of appearances for a given text
circumventing the ambiguity problem. We further propose a mechanism that
considers the contextual background of the sampled images to better localise
objects and segment the background directly. We show that our method can be
used to ground several existing pre-trained self-supervised feature extractors
in natural language and provide explainable predictions by mapping back to
regions in the support set. Our proposal is training-free, relying on
pre-trained components only, yet, shows strong performance on a range of
open-vocabulary segmentation benchmarks, obtaining a lead of more than 10% on
the Pascal VOC benchmark.
Related papers
- The Power of One: A Single Example is All it Takes for Segmentation in VLMs [29.735863112700358]
Large-scale vision-language models (VLMs) exhibit strong multimodal understanding capabilities by implicitly learning associations between textual descriptions and image regions.
This emergent ability enables zero-shot object detection and segmentation, using techniques that rely on text-image attention maps.
We show that this approach yields strong zero-shot performance, further enhanced through fine-tuning with a single visual example.
arXiv Detail & Related papers (2025-03-13T18:18:05Z) - Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models [7.374726900469744]
Open-vocabulary semantic segmentation attempts to classify and outline objects in an image using arbitrary text labels.<n>This study investigates simple yet efficient methods for adapting previously learned foundation models for open-vocabulary semantic segmentation tasks.<n>We propose "Beyond-Labels", a lightweight transformer-based fusion module that uses a small amount of image segmentation data to fuse frozen visual representations with language concepts.
arXiv Detail & Related papers (2025-01-28T07:49:52Z) - Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation [56.87049651707208]
Few-shot Semantic has evolved into In-context tasks, morphing into a crucial element in assessing generalist segmentation models.
Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework.
Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework.
arXiv Detail & Related papers (2024-10-03T10:33:49Z) - USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation [33.11010205890195]
The main challenge in open-vocabulary image segmentation now lies in accurately classifying these segments into text-defined categories.
We introduce the Universal Segment Embedding (USE) framework to address this challenge.
This framework is comprised of two key components: 1) a data pipeline designed to efficiently curate a large amount of segment-text pairs at various granularities, and 2) a universal segment embedding model that enables precise segment classification into a vast range of text-defined categories.
arXiv Detail & Related papers (2024-06-07T21:41:18Z) - Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation [44.008094698200026]
FreeDA is a training-free diffusion-augmented method for open-vocabulary semantic segmentation.
FreeDA achieves state-of-the-art performance on five datasets.
arXiv Detail & Related papers (2024-04-09T18:00:25Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Diffusion Model is Secretly a Training-free Open Vocabulary Semantic
Segmenter [47.29967666846132]
generative text-to-image diffusion models are highly efficient open-vocabulary semantic segmenters.
We introduce a novel training-free approach named DiffSegmenter to generate realistic objects that are semantically faithful to the input text.
Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2023-09-06T06:31:08Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - Open-vocabulary Panoptic Segmentation with Embedding Modulation [71.15502078615587]
Open-vocabulary image segmentation is attracting increasing attention due to its critical applications in the real world.
Traditional closed-vocabulary segmentation methods are not able to characterize novel objects, whereas several recent open-vocabulary attempts obtain unsatisfactory results.
We propose OPSNet, an omnipotent and data-efficient framework for Open-vocabulary Panopticon.
arXiv Detail & Related papers (2023-03-20T17:58:48Z) - SCNet: Enhancing Few-Shot Semantic Segmentation by Self-Contrastive
Background Prototypes [56.387647750094466]
Few-shot semantic segmentation aims to segment novel-class objects in a query image with only a few annotated examples.
Most of advanced solutions exploit a metric learning framework that performs segmentation through matching each pixel to a learned foreground prototype.
This framework suffers from biased classification due to incomplete construction of sample pairs with the foreground prototype only.
arXiv Detail & Related papers (2021-04-19T11:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.