Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion
Models
- URL: http://arxiv.org/abs/2303.04803v4
- Date: Wed, 5 Apr 2023 17:40:38 GMT
- Title: Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion
Models
- Authors: Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang,
Shalini De Mello
- Abstract summary: We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation.
It unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary segmentation.
Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks.
- Score: 44.17304848026688
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation,
which unifies pre-trained text-image diffusion and discriminative models to
perform open-vocabulary panoptic segmentation. Text-to-image diffusion models
have the remarkable ability to generate high-quality images with diverse
open-vocabulary language descriptions. This demonstrates that their internal
representation space is highly correlated with open concepts in the real world.
Text-image discriminative models like CLIP, on the other hand, are good at
classifying images into open-vocabulary labels. We leverage the frozen internal
representations of both these models to perform panoptic segmentation of any
category in the wild. Our approach outperforms the previous state of the art by
significant margins on both open-vocabulary panoptic and semantic segmentation
tasks. In particular, with COCO training only, our method achieves 23.4 PQ and
30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement
over the previous state of the art. We open-source our code and models at
https://github.com/NVlabs/ODISE .
Related papers
- Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models [57.37244894146089]
We propose Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks.
We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T16:20:56Z) - A Simple Framework for Open-Vocabulary Zero-Shot Segmentation [36.01531912271202]
SimZSS is a framework for open-vocabulary Zero-Shot segmentation.
It exploits the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions.
SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.
arXiv Detail & Related papers (2024-06-23T11:57:08Z) - Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation [44.008094698200026]
FreeDA is a training-free diffusion-augmented method for open-vocabulary semantic segmentation.
FreeDA achieves state-of-the-art performance on five datasets.
arXiv Detail & Related papers (2024-04-09T18:00:25Z) - GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields [50.68719394443926]
Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) is a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics.
GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-04-01T05:19:50Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.