SpaText: Spatio-Textual Representation for Controllable Image Generation
- URL: http://arxiv.org/abs/2211.14305v2
- Date: Sun, 19 Mar 2023 16:25:10 GMT
- Title: SpaText: Spatio-Textual Representation for Controllable Image Generation
- Authors: Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman,
Devi Parikh, Dani Lischinski, Ohad Fried, Xi Yin
- Abstract summary: SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
- Score: 61.89548017729586
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent text-to-image diffusion models are able to generate convincing results
of unprecedented quality. However, it is nearly impossible to control the
shapes of different regions/objects or their layout in a fine-grained fashion.
Previous attempts to provide such controls were hindered by their reliance on a
fixed set of labels. To this end, we present SpaText - a new method for
text-to-image generation using open-vocabulary scene control. In addition to a
global text prompt that describes the entire scene, the user provides a
segmentation map where each region of interest is annotated by a free-form
natural language description. Due to lack of large-scale datasets that have a
detailed textual description for each region in the image, we choose to
leverage the current large-scale text-to-image datasets and base our approach
on a novel CLIP-based spatio-textual representation, and show its effectiveness
on two state-of-the-art diffusion models: pixel-based and latent-based. In
addition, we show how to extend the classifier-free guidance method in
diffusion models to the multi-conditional case and present an alternative
accelerated inference algorithm. Finally, we offer several automatic evaluation
metrics and use them, in addition to FID scores and a user study, to evaluate
our method and show that it achieves state-of-the-art results on image
generation with free-form textual scene control.
Related papers
- Layout Agnostic Scene Text Image Synthesis with Diffusion Models [42.37340959594495]
SceneTextGen is a novel diffusion-based model specifically designed to circumvent the need for a predefined layout stage.
The novelty of SceneTextGen lies in its integration of three key components: a character-level encoder for capturing detailed typographic properties and a character-level instance segmentation model and a word-level spotting model to address the issues of unwanted text generation and minor character inaccuracies.
arXiv Detail & Related papers (2024-06-03T07:20:34Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model [31.819060415422353]
Diff-Text is a training-free scene text generation framework for any language.
Our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
arXiv Detail & Related papers (2023-12-19T15:18:40Z) - CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
We show that this simple approach enables flexible editing that is compatible with current image generation models.
Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z) - From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models [38.14123683674355]
We propose a method to utilize the attention mechanism in the denoising network of text-to-image diffusion models.
We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting.
Our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.
arXiv Detail & Related papers (2023-09-08T04:10:01Z) - Self-supervised Scene Text Segmentation with Object-centric Layered
Representations Augmented by Text Regions [22.090074821554754]
We propose a self-supervised scene text segmentation algorithm with layered decoupling of representations derived from the object-centric manner to segment images into texts and background.
On several public scene text datasets, our method outperforms the state-of-the-art unsupervised segmentation algorithms.
arXiv Detail & Related papers (2023-08-25T05:00:05Z) - Text Reading Order in Uncontrolled Conditions by Sparse Graph
Segmentation [71.40119152422295]
We propose a lightweight, scalable and generalizable approach to identify text reading order.
The model is language-agnostic and runs effectively across multi-language datasets.
It is small enough to be deployed on virtually any platform including mobile devices.
arXiv Detail & Related papers (2023-05-04T06:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.