Related papers: TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis

TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis

URL: http://arxiv.org/abs/2505.19291v2
Date: Thu, 17 Jul 2025 18:39:52 GMT
Title: TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis
Authors: Kazi Mahathir Rahman, Showrin Rahman, Sharmin Sultana Srishty,
Abstract summary: We propose a novel two-stage pipeline that integrates reinforcement learning (RL) for rapid and optimized text layout generation with a diffusion-based image synthesis model.<n>Our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility.<n>Our approach has been evaluated on the MARIOEval benchmark, achieving OCR and CLIPScore metrics close to state-of-the-art models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-embedded image generation plays a critical role in industries such as graphic design, advertising, and digital content creation. Text-to-Image generation methods leveraging diffusion models, such as TextDiffuser-2, have demonstrated promising results in producing images with embedded text. TextDiffuser-2 effectively generates bounding box layouts that guide the rendering of visual text, achieving high fidelity and coherence. However, existing approaches often rely on resource-intensive processes and are limited in their ability to run efficiently on both CPU and GPU platforms. To address these challenges, we propose a novel two-stage pipeline that integrates reinforcement learning (RL) for rapid and optimized text layout generation with a diffusion-based image synthesis model. Our RL-based approach significantly accelerates the bounding box prediction step while reducing overlaps, allowing the system to run efficiently on both CPUs and GPUs. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Our approach has been evaluated on the MARIOEval benchmark, achieving OCR and CLIPScore metrics close to state-of-the-art models, while being 97.64% more faster and requiring only 2MB of memory to run.

Related papers

Unified Text-Image Generation with Weakness-Targeted Post-Training [57.956648078400775]
Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis.<n>This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis.
arXiv Detail & Related papers (2026-01-07T19:19:44Z)
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework [76.44331001702379]
Lumina-Image 2.0 is a text-to-image generation framework that achieves significant progress compared to previous work.<n>It adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence.<n>We introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks.
arXiv Detail & Related papers (2025-03-27T17:57:07Z)
NFIG: Autoregressive Image Generation with Next-Frequency Prediction [50.69346038028673]
We present textbfNext-textbfFrequency textbfImage textbfGeneration (textbfNFIG), a novel framework that decomposes the image generation process into multiple frequency-guided stages.<n>Our approach first generates low-frequency components to establish global structure with fewer tokens, then progressively adds higher-frequency details, following the natural spectral hierarchy of images.
arXiv Detail & Related papers (2025-03-10T08:59:10Z)
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think [38.258453761376586]
We propose Dream Engine, an efficient framework designed for arbitrary text-image interleaved control in image generation models.<n>Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning.<n>Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark.
arXiv Detail & Related papers (2025-02-27T15:08:39Z)
Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss. Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z)
OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control [66.03885917320189]
OrientDream is a camera orientation conditioned framework for efficient and multi-view consistent 3D generation from textual prompts. Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module. Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods.
arXiv Detail & Related papers (2024-06-14T13:16:18Z)
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT [120.39362661689333]
We present an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities.
arXiv Detail & Related papers (2024-06-05T17:53:26Z)
TextCraftor: Your Text Encoder Can be Image Quality Controller [65.27457900325462]
Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation. We propose a proposed fine-tuning approach, TextCraftor, to enhance the performance of text-to-image diffusion models.
arXiv Detail & Related papers (2024-03-27T19:52:55Z)
Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation [5.55027585813848]
The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications. We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text. We demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores.
arXiv Detail & Related papers (2024-03-25T04:54:49Z)
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model. Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder. By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z)
Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z)
TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image Super-Resolution [17.95994419104427]
TextDiff is a diffusion-based framework tailored for scene text image super-resolution.<n>It achieves state-of-the-art (SOTA) performance on public benchmark datasets.<n>Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
arXiv Detail & Related papers (2023-08-13T11:02:16Z)
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos. Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z)
eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. We train an ensemble of text-to-image diffusion models specialized for different stages synthesis. Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z)
DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators. We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output. Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.