Related papers: Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation

Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation

URL: http://arxiv.org/abs/2404.06542v1
Date: Tue, 9 Apr 2024 18:00:25 GMT
Title: Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation
Authors: Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara,
Abstract summary: FreeDA is a training-free diffusion-augmented method for open-vocabulary semantic segmentation. FreeDA achieves state-of-the-art performance on five datasets.
Score: 44.008094698200026
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However, captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further, training on large-scale datasets inevitably brings significant computational costs. In this paper, we propose FreeDA, a training-free diffusion-augmented method for open-vocabulary semantic segmentation, which leverages the ability of diffusion models to visually localize generated concepts and local-global similarities to match class-agnostic regions with semantic classes. Our approach involves an offline stage in which textual-visual reference embeddings are collected, starting from a large set of captions and leveraging visual and semantic contexts. At test time, these are queried to support the visual matching process, which is carried out by jointly considering class-agnostic regions and global semantic similarities. Extensive analyses demonstrate that FreeDA achieves state-of-the-art performance on five datasets, surpassing previous methods by more than 7.0 average points in terms of mIoU and without requiring any training.

Related papers

Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation [13.743073097114461]
Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing.<n>We propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework to guide open-vocabulary segmentation models toward precise mapping.
arXiv Detail & Related papers (2026-02-09T02:09:21Z)
DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation [16.57245702815661]
Open-vocabulary semantic segmentation aims to assign labels to every pixel in an image based on text labels.<n>Existing approaches typically utilize vision-language models (VLMs), such as CLIP, for dense prediction.<n>We introduce DiSa, a novel saliency-aware foreground-background disentangled framework.
arXiv Detail & Related papers (2026-01-27T21:15:10Z)
GS: Generative Segmentation via Label Diffusion [59.380173266566715]
Language-driven image segmentation is a fundamental task in vision-language understanding, requiring models to segment regions of an image corresponding to natural language expressions.<n>Recent diffusion models have been introduced to this domain, but existing approaches remain image-centric.<n>We propose GS (Generative Label), a novel framework that formulates segmentation itself as a generative task.<n> Experimental results show that GS significantly outperforms existing discriminative and diffusion-based methods, setting a new state-of-the-art for language-driven segmentation.
arXiv Detail & Related papers (2025-08-27T16:28:15Z)
Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models [7.374726900469744]
Open-vocabulary semantic segmentation attempts to classify and outline objects in an image using arbitrary text labels.<n>This study investigates simple yet efficient methods for adapting previously learned foundation models for open-vocabulary semantic segmentation tasks.<n>We propose "Beyond-Labels", a lightweight transformer-based fusion module that uses a small amount of image segmentation data to fuse frozen visual representations with language concepts.
arXiv Detail & Related papers (2025-01-28T07:49:52Z)
Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data. CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z)
Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z)
Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models. ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image. Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.