UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models
- URL: http://arxiv.org/abs/2312.04884v1
- Date: Fri, 8 Dec 2023 07:47:46 GMT
- Title: UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models
- Authors: Yiming Zhao, Zhouhui Lian
- Abstract summary: This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
- Score: 25.219960711604728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-Image (T2I) generation methods based on diffusion model have garnered
significant attention in the last few years. Although these image synthesis
methods produce visually appealing results, they frequently exhibit spelling
errors when rendering text within the generated images. Such errors manifest as
missing, incorrect or extraneous characters, thereby severely constraining the
performance of text image generation based on diffusion models. To address the
aforementioned issue, this paper proposes a novel approach for text image
generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion
[27]). Our approach involves the design and training of a light-weight
character-level text encoder, which replaces the original CLIP encoder and
provides more robust text embeddings as conditional guidance. Then, we
fine-tune the diffusion model using a large-scale dataset, incorporating local
attention control under the supervision of character-level segmentation maps.
Finally, by employing an inference stage refinement process, we achieve a
notably high sequence accuracy when synthesizing text in arbitrarily given
images. Both qualitative and quantitative results demonstrate the superiority
of our method to the state of the art. Furthermore, we showcase several
potential applications of the proposed UDiffText, including text-centric image
synthesis, scene text editing, etc. Code and model will be available at
https://github.com/ZYM-PKU/UDiffText .
Related papers
- ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models [52.23899502520261]
We introduce a new framework named ARTIST to focus on the learning of text structures.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - Towards Real-time Text-driven Image Manipulation with Unconditional
Diffusion Models [33.993466872389085]
We develop a novel algorithm that learns image manipulations 4.5-10 times faster and applies them 8 times faster.
Our approach can adapt the pretrained model to the user-specified image and text description on the fly just for 4 seconds.
arXiv Detail & Related papers (2023-04-10T01:21:56Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.