Conditional Text Image Generation with Diffusion Models
- URL: http://arxiv.org/abs/2306.10804v1
- Date: Mon, 19 Jun 2023 09:44:43 GMT
- Title: Conditional Text Image Generation with Diffusion Models
- Authors: Yuanzhi Zhu, Zhaohai Li, Tianwei Wang, Mengchao He, Cong Yao
- Abstract summary: We propose a method called Text Image Generation with Conditional Models (CTIG-DM)
Four text image generation modes, namely: synthesis mode, augmentation mode, recovery mode, and imitation mode, can be derived by combining and configuring these three conditions.
CTIG-DM is able to produce image samples that simulate real-world complexity and diversity, and thus can boost the performance of existing text recognizers.
- Score: 18.017541111064602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current text recognition systems, including those for handwritten scripts and
scene text, have relied heavily on image synthesis and augmentation, since it
is difficult to realize real-world complexity and diversity through collecting
and annotating enough real text images. In this paper, we explore the problem
of text image generation, by taking advantage of the powerful abilities of
Diffusion Models in generating photo-realistic and diverse image samples with
given conditions, and propose a method called Conditional Text Image Generation
with Diffusion Models (CTIG-DM for short). To conform to the characteristics of
text images, we devise three conditions: image condition, text condition, and
style condition, which can be used to control the attributes, contents, and
styles of the samples in the image generation process. Specifically, four text
image generation modes, namely: (1) synthesis mode, (2) augmentation mode, (3)
recovery mode, and (4) imitation mode, can be derived by combining and
configuring these three conditions. Extensive experiments on both handwritten
and scene text demonstrate that the proposed CTIG-DM is able to produce image
samples that simulate real-world complexity and diversity, and thus can boost
the performance of existing text recognizers. Besides, CTIG-DM shows its
appealing potential in domain adaptation and generating images containing
Out-Of-Vocabulary (OOV) words.
Related papers
- Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild.
The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability.
The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z) - Diffusion-based Blind Text Image Super-Resolution [20.91578221617732]
We propose an Image Diffusion Model (IDM) to restore text images with realistic styles.
For diffusion models, they are not only suitable for modeling realistic image distribution but also appropriate for learning text distribution.
We also propose a Text Diffusion Model (TDM) for text recognition which can guide IDM to generate text images with correct structures.
arXiv Detail & Related papers (2023-12-13T06:03:17Z) - Scene Text Image Super-resolution based on Text-conditional Diffusion
Models [0.0]
Scene Text Image Super-resolution (STISR) has recently achieved great success as a preprocessing method for scene text recognition.
In this study, we leverage text-conditional diffusion models (DMs) for STISR tasks.
We propose a novel framework for LR-HR paired text image datasets.
arXiv Detail & Related papers (2023-11-16T10:32:18Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy.
We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.