Related papers: DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

URL: http://arxiv.org/abs/2512.01302v2
Date: Mon, 08 Dec 2025 05:26:07 GMT
Title: DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy
Authors: Jaewoo Song, Jooyoung Choi, Kanghyun Baek, Sangyub Lee, Daemin Park, Sungroh Yoon,
Abstract summary: DCText is a training-free visual text generation method that adopts a divide-and-conquer strategy.<n>Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region.<n>Experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality.
Score: 41.781258763025896
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.

Related papers

TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment [68.91073792449201]
We propose TextGuider, a training-free method that encourages accurate and complete text appearance.<n>Specifically, we analyze attention patterns in Multi-Modal Diffusion Transformer(MM-DiT) models, particularly for text-related tokens intended to be rendered in the image.<n>Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.
arXiv Detail & Related papers (2025-12-10T06:18:30Z)
UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis [38.658170067715965]
We propose a segmentation-guided framework that uses pixel-level visual text masks as unified conditional inputs.<n>Our approach achieves state-of-the-art performance on the AnyText benchmark.<n>We also introduce two new benchmarks: GlyphMM-benchmark for testing layout and glyph consistency in complex, and MiniText-benchmark for assessing generation quality in small-scale text regions.
arXiv Detail & Related papers (2025-07-01T17:42:19Z)
GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing [23.64662356622401]
We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision.<n>Our method achieves an 18.02% improvement in sentence accuracy over the state-of-the-art scene text editing baseline.
arXiv Detail & Related papers (2025-05-08T03:11:58Z)
SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild [55.619708995575785]
The text in natural scene images needs to meet the following four key criteria.<n>The generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks.<n>The generated images have superior utility in OCR tasks like text detection and text recognition.
arXiv Detail & Related papers (2025-01-06T12:09:08Z)
TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation [21.171612603385405]
We present TextCenGen, a training-free dynamic background adaptation in the blank region for text-friendly image generation.<n>Our method analyzes cross-attention maps to identify conflicting objects overlapping with text regions and uses a force-directed graph approach to guide their relocation.<n>Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality.
arXiv Detail & Related papers (2024-04-18T01:10:24Z)
Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z)
Self-supervised Scene Text Segmentation with Object-centric Layered Representations Augmented by Text Regions [22.090074821554754]
We propose a self-supervised scene text segmentation algorithm with layered decoupling of representations derived from the object-centric manner to segment images into texts and background. On several public scene text datasets, our method outperforms the state-of-the-art unsupervised segmentation algorithms.
arXiv Detail & Related papers (2023-08-25T05:00:05Z)
SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map. We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.