GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures
  in Text-to-Image Generation
        - URL: http://arxiv.org/abs/2303.17870v2
- Date: Tue, 23 May 2023 04:07:00 GMT
- Title: GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures
  in Text-to-Image Generation
- Authors: Jian Ma, Mingjun Zhao, Chen Chen, Ruichen Wang, Di Niu, Haonan Lu,
  Xiaodong Lin
- Abstract summary: We introduce GlyphDraw, a general learning framework aiming to endow image generation models with the capacity to generate images coherently embedded with text for any specific language.
Our method not only produces accurate language characters as in prompts, but also seamlessly blends the generated text into the background.
- Score: 18.396131717250793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Recent breakthroughs in the field of language-guided image generation have
yielded impressive achievements, enabling the creation of high-quality and
diverse images based on user instructions.Although the synthesis performance is
fascinating, one significant limitation of current image generation models is
their insufficient ability to generate text coherently within images,
particularly for complex glyph structures like Chinese characters. To address
this problem, we introduce GlyphDraw, a general learning framework aiming to
endow image generation models with the capacity to generate images coherently
embedded with text for any specific language.We first sophisticatedly design
the image-text dataset's construction strategy, then build our model
specifically on a diffusion-based image generator and carefully modify the
network structure to allow the model to learn drawing language characters with
the help of glyph and position information.Furthermore, we maintain the model's
open-domain image synthesis capability by preventing catastrophic forgetting by
using parameter-efficient fine-tuning techniques.Extensive qualitative and
quantitative experiments demonstrate that our method not only produces accurate
language characters as in prompts, but also seamlessly blends the generated
text into the background.Please refer to our
\href{https://1073521013.github.io/glyph-draw.github.io/}{project page}.
\end{abstract}
 
      
        Related papers
        - Beyond Words: Advancing Long-Text Image Generation via Multimodal   Autoregressive Models [76.68654868991517]
 Long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models.
We introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features.
We develop ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity.
 arXiv  Detail & Related papers  (2025-03-26T03:44:25Z)
- Beyond Flat Text: Dual Self-inherited Guidance for Visual Text   Generation [17.552733309504486]
 In real-world images, slanted or curved texts, especially those on cans, banners, or badges, appear as frequently as flat texts due to artistic design or layout constraints.
We introduce a new training-free framework, STGen, which accurately generates visual texts in challenging scenarios.
 arXiv  Detail & Related papers  (2025-01-10T11:44:59Z)
- Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
 This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate.
We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references.
Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
 arXiv  Detail & Related papers  (2024-11-22T21:38:51Z)
- Prompt-Consistency Image Generation (PCIG): A Unified Framework   Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
 We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions.
Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image.
We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
 arXiv  Detail & Related papers  (2024-06-24T06:12:16Z)
- ARTIST: Improving the Generation of Text-rich Images with Disentangled   Diffusion Models [52.23899502520261]
 We introduce a new framework named ARTIST to focus on the learning of text structures.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
 Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
 arXiv  Detail & Related papers  (2024-06-17T19:31:24Z)
- AutoStory: Generating Diverse Storytelling Images with Minimal Human
  Effort [55.83007338095763]
 We propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images.
We utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images.
 arXiv  Detail & Related papers  (2023-11-19T06:07:37Z)
- LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
  Generation [121.45667242282721]
 We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
 arXiv  Detail & Related papers  (2023-08-09T17:45:04Z)
- GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
 We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
 arXiv  Detail & Related papers  (2023-04-25T02:14:44Z)
- Unified Multi-Modal Latent Diffusion for Joint Subject and Text
  Conditional Image Generation [63.061871048769596]
 We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
 arXiv  Detail & Related papers  (2023-03-16T13:50:20Z)
- Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
  Translation [10.39028769374367]
 We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
 arXiv  Detail & Related papers  (2022-11-22T20:39:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.