Related papers: ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations

ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations

URL: http://arxiv.org/abs/2502.10999v1
Date: Sun, 16 Feb 2025 05:30:18 GMT
Title: ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations
Authors: Bowen Jiang, Yuan Yuan, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao, Lyle Ungar, Camillo J. Taylor,
Abstract summary: This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations.<n>The experiment provides a proof of concept of our algorithm in zero-shot text and font editing across diverse fonts and languages.
Score: 8.588945675550592
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations. Visual text rendering remains a significant challenge. While recent methods condition diffusion on glyphs, it is impossible to retrieve exact font annotations from large-scale, real-world datasets, which prevents user-specified font control. To address this, we propose a data-driven solution that integrates the conditional diffusion model with a text segmentation model, utilizing segmentation masks to capture and represent fonts in pixel space in a self-supervised manner, thereby eliminating the need for any ground-truth labels and enabling users to customize text rendering with any multilingual font of their choice. The experiment provides a proof of concept of our algorithm in zero-shot text and font editing across diverse fonts and languages, providing valuable insights for the community and industry toward achieving generalized visual text rendering.

Related papers

FontAdapter: Instant Font Adaptation in Visual Text Generation [45.00544198317519]
We present FontAdapter, a framework that enables visual text generation in unseen fonts within seconds, conditioned on a reference glyph image.<n>Experiments demonstrate that FontAdapter enables high-quality, robust font customization across unseen fonts without additional fine-tuning during inference.
arXiv Detail & Related papers (2025-06-06T08:00:49Z)
RepText: Rendering Visual Text via Replicating [15.476598851383919]
We present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render visual text in user-specified fonts. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text. Our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models.
arXiv Detail & Related papers (2025-04-28T12:19:53Z)
First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending [5.3798706094384725]
We propose a new visual text blending paradigm including both creating backgrounds and rendering texts. Specifically, a background generator is developed to produce high-fidelity and text-free natural images. We also explore several downstream applications based on our method, including scene text dataset synthesis for boosting scene text detectors.
arXiv Detail & Related papers (2024-10-14T05:23:43Z)
JoyType: A Robust Design for Multilingual Visual Text Creation [14.441897362967344]
We introduce a novel approach for multilingual visual text creation, named JoyType. JoyType is designed to maintain the font style of text during the image generation process. Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2024-09-26T04:23:17Z)
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering [118.30923824681642]
TextDiffuser-2 aims to unleash the power of language models for text rendering. We utilize the language model within the diffusion model to encode the position and texts at the line level. We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V.
arXiv Detail & Related papers (2023-11-28T04:02:40Z)
GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation. Our key idea is to render the target text as a glyph image containing visual language content. Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z)
SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map. We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
Few-Shot Font Generation by Learning Fine-Grained Local Styles [90.39288370855115]
Few-shot font generation (FFG) aims to generate a new font with a few examples. We propose a new font generation approach by learning 1) the fine-grained local styles from references, and 2) the spatial correspondence between the content and reference glyphs.
arXiv Detail & Related papers (2022-05-20T05:07:05Z)
A Multi-Implicit Neural Representation for Fonts [79.6123184198301]
font-specific discontinuities like edges and corners are difficult to represent using neural networks. We introduce textitmulti-implicits to represent fonts as a permutation-in set of learned implict functions, without losing features.
arXiv Detail & Related papers (2021-06-12T21:40:11Z)
Exploring Font-independent Features for Scene Text Recognition [22.34023249700896]
Scene text recognition (STR) has been extensively studied in last few years. Many recently-proposed methods are specially designed to accommodate the arbitrary shape, layout and orientation of scene texts. These methods, where font features and content features of characters are tangled, perform poorly in text recognition on scene images with texts in novel font styles.
arXiv Detail & Related papers (2020-09-16T03:36:59Z)
Let Me Choose: From Verbal Context to Font Selection [50.293897197235296]
We aim to learn associations between visual attributes of fonts and the verbal context of the texts they are typically applied to. We introduce a new dataset, containing examples of different topics in social media posts and ads, labeled through crowd-sourcing.
arXiv Detail & Related papers (2020-05-03T17:36:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.