Related papers: Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

URL: http://arxiv.org/abs/2306.09316v1
Date: Thu, 15 Jun 2023 17:51:28 GMT
Title: Diffusion Models for Zero-Shot Open-Vocabulary Segmentation
Authors: Laurynas Karazija, Iro Laina, Andrea Vedaldi, Christian Rupprecht
Abstract summary: This paper proposes a new method for zero-shot open-vocabulary segmentation. We leverage the generative properties of large-scale text-to-image diffusion models to sample a set of support images. We show that our method can be used to ground several existing pre-trained self-supervised feature extractors in natural language.
Score: 97.25882784890456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The variety of objects in the real world is nearly unlimited and is thus impossible to capture using models trained on a fixed set of categories. As a result, in recent years, open-vocabulary methods have attracted the interest of the community. This paper proposes a new method for zero-shot open-vocabulary segmentation. Prior work largely relies on contrastive training using image-text pairs, leveraging grouping mechanisms to learn image features that are both aligned with language and well-localised. This however can introduce ambiguity as the visual appearance of images with similar captions often varies. Instead, we leverage the generative properties of large-scale text-to-image diffusion models to sample a set of support images for a given textual category. This provides a distribution of appearances for a given text circumventing the ambiguity problem. We further propose a mechanism that considers the contextual background of the sampled images to better localise objects and segment the background directly. We show that our method can be used to ground several existing pre-trained self-supervised feature extractors in natural language and provide explainable predictions by mapping back to regions in the support set. Our proposal is training-free, relying on pre-trained components only, yet, shows strong performance on a range of open-vocabulary segmentation benchmarks, obtaining a lead of more than 10% on the Pascal VOC benchmark.

Related papers

The Power of One: A Single Example is All it Takes for Segmentation in VLMs [29.735863112700358]
Large-scale vision-language models (VLMs) exhibit strong multimodal understanding capabilities by implicitly learning associations between textual descriptions and image regions. This emergent ability enables zero-shot object detection and segmentation, using techniques that rely on text-image attention maps. We show that this approach yields strong zero-shot performance, further enhanced through fine-tuning with a single visual example.
arXiv Detail & Related papers (2025-03-13T18:18:05Z)
Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models [7.374726900469744]
Open-vocabulary semantic segmentation attempts to classify and outline objects in an image using arbitrary text labels.<n>This study investigates simple yet efficient methods for adapting previously learned foundation models for open-vocabulary semantic segmentation tasks.<n>We propose "Beyond-Labels", a lightweight transformer-based fusion module that uses a small amount of image segmentation data to fuse frozen visual representations with language concepts.
arXiv Detail & Related papers (2025-01-28T07:49:52Z)
Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation [56.87049651707208]
Few-shot Semantic has evolved into In-context tasks, morphing into a crucial element in assessing generalist segmentation models. Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework. Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework.
arXiv Detail & Related papers (2024-10-03T10:33:49Z)
USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation [33.11010205890195]
The main challenge in open-vocabulary image segmentation now lies in accurately classifying these segments into text-defined categories. We introduce the Universal Segment Embedding (USE) framework to address this challenge. This framework is comprised of two key components: 1) a data pipeline designed to efficiently curate a large amount of segment-text pairs at various granularities, and 2) a universal segment embedding model that enables precise segment classification into a vast range of text-defined categories.
arXiv Detail & Related papers (2024-06-07T21:41:18Z)
Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation [44.008094698200026]
FreeDA is a training-free diffusion-augmented method for open-vocabulary semantic segmentation. FreeDA achieves state-of-the-art performance on five datasets.
arXiv Detail & Related papers (2024-04-09T18:00:25Z)
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets. We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation. Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z)
Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter [47.29967666846132]
generative text-to-image diffusion models are highly efficient open-vocabulary semantic segmenters. We introduce a novel training-free approach named DiffSegmenter to generate realistic objects that are semantically faithful to the input text. Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2023-09-06T06:31:08Z)
Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models. ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image. Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z)
Open-vocabulary Panoptic Segmentation with Embedding Modulation [71.15502078615587]
Open-vocabulary image segmentation is attracting increasing attention due to its critical applications in the real world. Traditional closed-vocabulary segmentation methods are not able to characterize novel objects, whereas several recent open-vocabulary attempts obtain unsatisfactory results. We propose OPSNet, an omnipotent and data-efficient framework for Open-vocabulary Panopticon.
arXiv Detail & Related papers (2023-03-20T17:58:48Z)
SCNet: Enhancing Few-Shot Semantic Segmentation by Self-Contrastive Background Prototypes [56.387647750094466]
Few-shot semantic segmentation aims to segment novel-class objects in a query image with only a few annotated examples. Most of advanced solutions exploit a metric learning framework that performs segmentation through matching each pixel to a learned foreground prototype. This framework suffers from biased classification due to incomplete construction of sample pairs with the foreground prototype only.
arXiv Detail & Related papers (2021-04-19T11:21:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.