DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic
Segmentation Using Diffusion Models
- URL: http://arxiv.org/abs/2303.11681v4
- Date: Sun, 21 Jan 2024 13:35:44 GMT
- Title: DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic
Segmentation Using Diffusion Models
- Authors: Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, Chunhua Shen
- Abstract summary: We show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model.
Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image.
- Score: 68.21154597227165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Collecting and annotating images with pixel-wise labels is time-consuming and
laborious. In contrast, synthetic data can be freely available using a
generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that
it is possible to automatically obtain accurate semantic masks of synthetic
images generated by the Off-the-shelf Stable Diffusion model, which uses only
text-image pairs during training. Our approach, called DiffuMask, exploits the
potential of the cross-attention map between text and image, which is natural
and seamless to extend the text-driven image synthesis to semantic mask
generation. DiffuMask uses text-guided cross-attention information to localize
class/word-specific regions, which are combined with practical techniques to
create a novel high-resolution and class-discriminative pixel-wise mask. The
methods help to reduce data collection and annotation costs obviously.
Experiments demonstrate that the existing segmentation methods trained on
synthetic data of DiffuMask can achieve a competitive performance over the
counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird),
DiffuMask presents promising performance, close to the stateof-the-art result
of real data (within 3% mIoU gap). Moreover, in the open-vocabulary
segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on
Unseen class of VOC 2012. The project website can be found at
https://weijiawu.github.io/DiffusionMask/.
Related papers
- DiffuMask-Editor: A Novel Paradigm of Integration Between the Segmentation Diffusion Model and Image Editing to Improve Segmentation Ability [5.767984430681467]
This paper introduces DiffuMask-Editor, which combines the Diffusion Model for annotated datasets with Image Editing.
By integrating multiple objects into images using Text2Image models, our method facilitates the creation of more realistic datasets.
Results demonstrate that synthetic data generated by DiffuMask-Editor enable segmentation methods to achieve superior performance compared to real data.
arXiv Detail & Related papers (2024-11-04T05:39:01Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation [16.863038973001483]
This work introduces three techniques for diffusion-synthetic semantic segmentation training.
First, reliability-aware robust training, originally used in weakly supervised learning, helps segmentation with insufficient synthetic mask quality.
Second, large-scale pretraining of whole segmentation models, not only backbones, on synthetic ImageNet-1k-class images with pixel-labels benefits downstream segmentation tasks.
Third, we introduce prompt augmentation, data augmentation to the prompt text set to scale up and diversify training images with a limited text resources.
arXiv Detail & Related papers (2023-09-04T05:34:19Z) - DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion
Models [61.906934570771256]
We present a generic dataset generation model that can produce diverse synthetic images and perception annotations.
Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation.
We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module.
arXiv Detail & Related papers (2023-08-11T14:38:11Z) - Automatic Generation of Semantic Parts for Face Image Synthesis [7.728916126705043]
We describe a network architecture to address the problem of automatically manipulating or generating the shape of object classes in semantic segmentation masks.
Our proposed model allows embedding the mask class-wise into a latent space where each class embedding can be independently edited.
We report quantitative and qualitative results on the Celeb-MaskHQ dataset, which show our model can both faithfully reconstruct and modify a segmentation mask at the class level.
arXiv Detail & Related papers (2023-07-11T15:01:42Z) - DFormer: Diffusion-guided Transformer for Universal Image Segmentation [86.73405604947459]
The proposed DFormer views universal image segmentation task as a denoising process using a diffusion model.
At inference, our DFormer directly predicts the masks and corresponding categories from a set of randomly-generated masks.
Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val 2017 set.
arXiv Detail & Related papers (2023-06-06T06:33:32Z) - MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer [158.06850125920923]
diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image.
We propose a mask latent modeling scheme to explicitly enhance the DPMs' ability to contextual relation learning among object semantic parts in an image.
Experimental results show that MDTv2 achieves superior image synthesis performance, e.g., a new SOTA FID score of 1.58 on the ImageNet dataset, and has more than 10x faster learning speed than the previous SOTA DiT.
arXiv Detail & Related papers (2023-03-25T07:47:21Z) - Foreground-Background Separation through Concept Distillation from
Generative Image Foundation Models [6.408114351192012]
We present a novel method that enables the generation of general foreground-background segmentation models from simple textual descriptions.
We show results on the task of segmenting four different objects (humans, dogs, cars, birds) and a use case scenario in medical image analysis.
arXiv Detail & Related papers (2022-12-29T13:51:54Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z) - Few-shot Semantic Image Synthesis Using StyleGAN Prior [8.528384027684192]
We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior.
Our key idea is to construct a simple mapping between the StyleGAN feature and each semantic class from a few examples of semantic masks.
Although the pseudo semantic masks might be too coarse for previous approaches that require pixel-aligned masks, our framework can synthesize high-quality images from not only dense semantic masks but also sparse inputs such as landmarks and scribbles.
arXiv Detail & Related papers (2021-03-27T11:04:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.