Related papers: RepText: Rendering Visual Text via Replicating

RepText: Rendering Visual Text via Replicating

URL: http://arxiv.org/abs/2504.19724v1
Date: Mon, 28 Apr 2025 12:19:53 GMT
Title: RepText: Rendering Visual Text via Replicating
Authors: Haofan Wang, Yujia Xu, Yimeng Li, Junchen Li, Chaowei Zhang, Jing Wang, Kejia Yang, Zhibo Chen,
Abstract summary: We present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render visual text in user-specified fonts.<n>Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text.<n>Our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models.
Score: 15.476598851383919
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.

Related papers

UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis [38.658170067715965]
We propose a segmentation-guided framework that uses pixel-level visual text masks as unified conditional inputs.<n>Our approach achieves state-of-the-art performance on the AnyText benchmark.<n>We also introduce two new benchmarks: GlyphMM-benchmark for testing layout and glyph consistency in complex, and MiniText-benchmark for assessing generation quality in small-scale text regions.
arXiv Detail & Related papers (2025-07-01T17:42:19Z)
Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation [17.552733309504486]
In real-world images, slanted or curved texts, especially those on cans, banners, or badges, appear as frequently as flat texts due to artistic design or layout constraints.<n>We introduce a new training-free framework, STGen, which accurately generates visual texts in challenging scenarios.
arXiv Detail & Related papers (2025-01-10T11:44:59Z)
First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending [5.3798706094384725]
We propose a new visual text blending paradigm including both creating backgrounds and rendering texts. Specifically, a background generator is developed to produce high-fidelity and text-free natural images. We also explore several downstream applications based on our method, including scene text dataset synthesis for boosting scene text detectors.
arXiv Detail & Related papers (2024-10-14T05:23:43Z)
TextMaster: A Unified Framework for Realistic Text Editing via Glyph-Style Dual-Control [5.645654441900668]
We propose TextMaster, a solution capable of accurately editing text across various scenarios and image regions.<n>Our method enhances the accuracy and fidelity of text rendering by incorporating high-resolution standard glyph information.<n>We also propose a novel style injection technique that enables controllable style transfer for the injected text.
arXiv Detail & Related papers (2024-10-13T15:39:39Z)
Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild. The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z)
Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z)
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model [31.819060415422353]
Diff-Text is a training-free scene text generation framework for any language. Our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
arXiv Detail & Related papers (2023-12-19T15:18:40Z)
Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z)
GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation. Our key idea is to render the target text as a glyph image containing visual language content. Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z)
GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures in Text-to-Image Generation [18.396131717250793]
We introduce GlyphDraw, a general learning framework aiming to endow image generation models with the capacity to generate images coherently embedded with text for any specific language. Our method not only produces accurate language characters as in prompts, but also seamlessly blends the generated text into the background.
arXiv Detail & Related papers (2023-03-31T08:06:33Z)
SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map. We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.