UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis
- URL: http://arxiv.org/abs/2507.00992v2
- Date: Wed, 02 Jul 2025 04:08:46 GMT
- Title: UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis
- Authors: Yuanrui Wang, Cong Han, Yafei Li, Zhipeng Jin, Xiawei Li, SiNan Du, Wen Tao, Yi Yang, Shuanglong Li, Chun Yuan, Liu Lin,
- Abstract summary: We propose a segmentation-guided framework that uses pixel-level visual text masks as unified conditional inputs.<n>Our approach achieves state-of-the-art performance on the AnyText benchmark.<n>We also introduce two new benchmarks: GlyphMM-benchmark for testing layout and glyph consistency in complex, and MiniText-benchmark for assessing generation quality in small-scale text regions.
- Score: 38.658170067715965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image generation has greatly advanced content creation, yet accurately rendering visual text remains a key challenge due to blurred glyphs, semantic drift, and limited style control. Existing methods often rely on pre-rendered glyph images as conditions, but these struggle to retain original font styles and color cues, necessitating complex multi-branch designs that increase model overhead and reduce flexibility. To address these issues, we propose a segmentation-guided framework that uses pixel-level visual text masks -- rich in glyph shape, color, and spatial detail -- as unified conditional inputs. Our method introduces two core components: (1) a fine-tuned bilingual segmentation model for precise text mask extraction, and (2) a streamlined diffusion model augmented with adaptive glyph conditioning and a region-specific loss to preserve textual fidelity in both content and style. Our approach achieves state-of-the-art performance on the AnyText benchmark, significantly surpassing prior methods in both Chinese and English settings. To enable more rigorous evaluation, we also introduce two new benchmarks: GlyphMM-benchmark for testing layout and glyph consistency in complex typesetting, and MiniText-benchmark for assessing generation quality in small-scale text regions. Experimental results show that our model outperforms existing methods by a large margin in both scenarios, particularly excelling at small text rendering and complex layout preservation, validating its strong generalization and deployment readiness.
Related papers
- Qwen-Image Technical Report [86.46471547116158]
We present Qwen-Image, an image generation foundation model that achieves significant advances in complex text rendering and precise image editing.<n>We design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing.<n>Qwen-Image performs exceptionally well in alphabetic languages such as English, and also achieves remarkable progress on more challenging logographic languages like Chinese.
arXiv Detail & Related papers (2025-08-04T11:49:20Z) - GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing [23.64662356622401]
We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision.<n>Our method achieves an 18.02% improvement in sentence accuracy over the state-of-the-art scene text editing baseline.
arXiv Detail & Related papers (2025-05-08T03:11:58Z) - RepText: Rendering Visual Text via Replicating [15.476598851383919]
We present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render visual text in user-specified fonts.<n>Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text.<n>Our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models.
arXiv Detail & Related papers (2025-04-28T12:19:53Z) - Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation [17.552733309504486]
In real-world images, slanted or curved texts, especially those on cans, banners, or badges, appear as frequently as flat texts due to artistic design or layout constraints.<n>We introduce a new training-free framework, STGen, which accurately generates visual texts in challenging scenarios.
arXiv Detail & Related papers (2025-01-10T11:44:59Z) - TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles [12.182588762414058]
Scene text editing aims to modify texts on images while maintaining the style of newly generated text similar to the original.
Recent works leverage diffusion models, showing improved results, yet still face challenges.
We present emphTextMastero - a carefully designed multilingual scene text editing architecture based on latent diffusion models (LDMs)
arXiv Detail & Related papers (2024-08-20T08:06:09Z) - Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild.
The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability.
The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z) - Layout Agnostic Scene Text Image Synthesis with Diffusion Models [42.37340959594495]
SceneTextGen is a novel diffusion-based model specifically designed to circumvent the need for a predefined layout stage.
The novelty of SceneTextGen lies in its integration of three key components: a character-level encoder for capturing detailed typographic properties and a character-level instance segmentation model and a word-level spotting model to address the issues of unwanted text generation and minor character inaccuracies.
arXiv Detail & Related papers (2024-06-03T07:20:34Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - SceneComposer: Any-Level Semantic Image Synthesis [80.55876413285587]
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels.
The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level.
We introduce several novel techniques to address the challenges coming with this new setup.
arXiv Detail & Related papers (2022-11-21T18:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.