Beyond Generation: Harnessing Text to Image Models for Object Detection
and Segmentation
- URL: http://arxiv.org/abs/2309.05956v1
- Date: Tue, 12 Sep 2023 04:41:45 GMT
- Title: Beyond Generation: Harnessing Text to Image Models for Object Detection
and Segmentation
- Authors: Yunhao Ge, Jiashu Xu, Brian Nlong Zhao, Neel Joshi, Laurent Itti,
Vibhav Vineet
- Abstract summary: We propose a new paradigm to automatically generate training data with accurate labels at scale.
The proposed approach decouples training data generation into foreground object generation, and contextually coherent background generation.
We demonstrate the advantages of our approach on five object detection and segmentation datasets.
- Score: 29.274362919954218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new paradigm to automatically generate training data with
accurate labels at scale using the text-to-image synthesis frameworks (e.g.,
DALL-E, Stable Diffusion, etc.). The proposed approach1 decouples training data
generation into foreground object generation, and contextually coherent
background generation. To generate foreground objects, we employ a
straightforward textual template, incorporating the object class name as input
prompts. This is fed into a text-to-image synthesis framework, producing
various foreground images set against isolated backgrounds. A
foreground-background segmentation algorithm is then used to generate
foreground object masks. To generate context images, we begin by creating
language descriptions of the context. This is achieved by applying an image
captioning method to a small set of images representing the desired context.
These textual descriptions are then transformed into a diverse array of context
images via a text-to-image synthesis framework. Subsequently, we composite
these with the foreground object masks produced in the initial step, utilizing
a cut-and-paste method, to formulate the training data. We demonstrate the
advantages of our approach on five object detection and segmentation datasets,
including Pascal VOC and COCO. We found that detectors trained solely on
synthetic data produced by our method achieve performance comparable to those
trained on real data (Fig. 1). Moreover, a combination of real and synthetic
data yields even much better results. Further analysis indicates that the
synthetic data distribution complements the real data distribution effectively.
Additionally, we emphasize the compositional nature of our data generation
approach in out-of-distribution and zero-shot data generation scenarios. We
open-source our code at https://github.com/gyhandy/Text2Image-for-Detection
Related papers
- Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - Adapt Anything: Tailor Any Image Classifiers across Domains And
Categories Using Text-to-Image Diffusion Models [82.95591765009105]
We aim to study if a modern text-to-image diffusion model can tailor any task-adaptive image classifier across domains and categories.
We utilize only one off-the-shelf text-to-image model to synthesize images with category labels derived from the corresponding text prompts.
arXiv Detail & Related papers (2023-10-25T11:58:14Z) - Style Generation: Image Synthesis based on Coarsely Matched Texts [10.939482612568433]
We introduce a novel task called text-based style generation and propose a two-stage generative adversarial network.
The first stage generates the overall image style with a sentence feature, and the second stage refines the generated style with a synthetic feature.
The practical potential of our work is demonstrated by various applications such as text-image alignment and story visualization.
arXiv Detail & Related papers (2023-09-08T21:51:11Z) - Self-supervised Scene Text Segmentation with Object-centric Layered
Representations Augmented by Text Regions [22.090074821554754]
We propose a self-supervised scene text segmentation algorithm with layered decoupling of representations derived from the object-centric manner to segment images into texts and background.
On several public scene text datasets, our method outperforms the state-of-the-art unsupervised segmentation algorithms.
arXiv Detail & Related papers (2023-08-25T05:00:05Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - WordStylist: Styled Verbatim Handwritten Text Generation with Latent
Diffusion Models [8.334487584550185]
We present a latent diffusion-based method for styled text-to-text-content-image generation on word-level.
Our proposed method is able to generate realistic word image samples from different writer styles.
We show that the proposed model produces samples that are aesthetically pleasing, help boosting text recognition performance, and get similar writer retrieval score as real data.
arXiv Detail & Related papers (2023-03-29T10:19:26Z) - Scrape, Cut, Paste and Learn: Automated Dataset Generation Applied to
Parcel Logistics [58.720142291102135]
We present a fully automated pipeline to generate a synthetic dataset for instance segmentation in four steps.
We first scrape images for the objects of interest from popular image search engines.
We compare three different methods for image selection: Object-agnostic pre-processing, manual image selection and CNN-based image selection.
arXiv Detail & Related papers (2022-10-18T12:49:04Z) - DALL-E for Detection: Language-driven Context Image Synthesis for Object
Detection [18.276823176045525]
We propose a new paradigm for automatic context image generation at scale.
At the core of our approach lies utilizing an interplay between language description of context and language-driven image generation.
We demonstrate the advantages of our approach over the prior context image generation approaches on four object detection datasets.
arXiv Detail & Related papers (2022-06-20T06:43:17Z) - OptGAN: Optimizing and Interpreting the Latent Space of the Conditional
Text-to-Image GANs [8.26410341981427]
We study how to ensure that generated samples are believable, realistic or natural.
We present a novel algorithm which identifies semantically-understandable directions in the latent space of a conditional text-to-image GAN architecture.
arXiv Detail & Related papers (2022-02-25T20:00:33Z) - Text-to-Image Generation Grounded by Fine-Grained User Attention [62.94737811887098]
Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces.
We propose TReCS, a sequential model that exploits this grounding to generate images.
arXiv Detail & Related papers (2020-11-07T13:23:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.