FreeMask: Synthetic Images with Dense Annotations Make Stronger
Segmentation Models
- URL: http://arxiv.org/abs/2310.15160v1
- Date: Mon, 23 Oct 2023 17:57:27 GMT
- Title: FreeMask: Synthetic Images with Dense Annotations Make Stronger
Segmentation Models
- Authors: Lihe Yang, Xiaogang Xu, Bingyi Kang, Yinghuan Shi, Hengshuang Zhao
- Abstract summary: FreeMask resorts to synthetic images from generative models to ease the burden of data collection and annotation procedures.
We first synthesize abundant training images conditioned on the semantic masks provided by realistic datasets.
We investigate the role of synthetic images by joint training with real images, or pre-training for real images.
- Score: 62.009002395326384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic segmentation has witnessed tremendous progress due to the proposal
of various advanced network architectures. However, they are extremely hungry
for delicate annotations to train, and the acquisition is laborious and
unaffordable. Therefore, we present FreeMask in this work, which resorts to
synthetic images from generative models to ease the burden of both data
collection and annotation procedures. Concretely, we first synthesize abundant
training images conditioned on the semantic masks provided by realistic
datasets. This yields extra well-aligned image-mask training pairs for semantic
segmentation models. We surprisingly observe that, solely trained with
synthetic images, we already achieve comparable performance with real ones
(e.g., 48.3 vs. 48.5 mIoU on ADE20K, and 49.3 vs. 50.5 on COCO-Stuff). Then, we
investigate the role of synthetic images by joint training with real images, or
pre-training for real images. Meantime, we design a robust filtering principle
to suppress incorrectly synthesized regions. In addition, we propose to
inequally treat different semantic masks to prioritize those harder ones and
sample more corresponding synthetic images for them. As a result, either
jointly trained or pre-trained with our filtered and re-sampled synthesized
images, segmentation models can be greatly enhanced, e.g., from 48.7 to 52.0 on
ADE20K. Code is available at https://github.com/LiheYoung/FreeMask.
Related papers
- Learning Vision from Models Rivals Learning Vision from Data [54.43596959598465]
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions.
We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption.
We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs.
arXiv Detail & Related papers (2023-12-28T18:59:55Z) - SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis [36.76548097887539]
SegGen is a highly-effective training data generation method for image segmentation.
MaskSyn synthesizes new mask-image pairs via proposed text-to-mask generation model and mask-to-image generation model.
ImgSyn synthesizes new images based on existing masks using the mask-to-image generation model.
arXiv Detail & Related papers (2023-11-06T18:59:57Z) - Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation [16.863038973001483]
This work introduces three techniques for diffusion-synthetic semantic segmentation training.
First, reliability-aware robust training, originally used in weakly supervised learning, helps segmentation with insufficient synthetic mask quality.
Second, large-scale pretraining of whole segmentation models, not only backbones, on synthetic ImageNet-1k-class images with pixel-labels benefits downstream segmentation tasks.
Third, we introduce prompt augmentation, data augmentation to the prompt text set to scale up and diversify training images with a limited text resources.
arXiv Detail & Related papers (2023-09-04T05:34:19Z) - DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic
Segmentation Using Diffusion Models [68.21154597227165]
We show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model.
Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image.
arXiv Detail & Related papers (2023-03-21T08:43:15Z) - Mask Conditional Synthetic Satellite Imagery [10.235751992415867]
mask-conditional synthetic image generation model for creating synthetic satellite imagery datasets.
We show that it is possible to train an upstream conditional synthetic imagery generator, use that generator to create synthetic imagery with the land cover masks.
We find that incorporating a mixture of real and synthetic imagery acts as a data augmentation method, producing better models than using only real imagery.
arXiv Detail & Related papers (2023-02-08T19:42:37Z) - One-Shot Synthesis of Images and Segmentation Masks [28.119303696418882]
Joint synthesis of images and segmentation masks with generative adversarial networks (GANs) is promising to reduce the effort needed for collecting image data with pixel-wise annotations.
To learn high-fidelity image-mask synthesis, existing GAN approaches first need a pre-training phase requiring large amounts of image data.
We introduce our OSMIS model which enables the synthesis of segmentation masks that are precisely aligned to the generated images in the one-shot regime.
arXiv Detail & Related papers (2022-09-15T18:00:55Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - TAGPerson: A Target-Aware Generation Pipeline for Person
Re-identification [65.60874203262375]
We propose a novel Target-Aware Generation pipeline to produce synthetic person images, called TAGPerson.
Specifically, it involves a parameterized rendering method, where the parameters are controllable and can be adjusted according to target scenes.
In our experiments, our target-aware synthetic images can achieve a much higher performance than the generalized synthetic images on MSMT17, i.e. 47.5% vs. 40.9% for rank-1 accuracy.
arXiv Detail & Related papers (2021-12-28T17:56:19Z) - A Shared Representation for Photorealistic Driving Simulators [83.5985178314263]
We propose to improve the quality of generated images by rethinking the discriminator architecture.
The focus is on the class of problems where images are generated given semantic inputs, such as scene segmentation maps or human body poses.
We aim to learn a shared latent representation that encodes enough information to jointly do semantic segmentation, content reconstruction, along with a coarse-to-fine grained adversarial reasoning.
arXiv Detail & Related papers (2021-12-09T18:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.