Related papers: CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder

CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder

URL: http://arxiv.org/abs/2412.17225v1
Date: Mon, 23 Dec 2024 02:40:07 GMT
Title: CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder
Authors: Lichen Ma, Tiezhu Yue, Pei Fu, Yujie Zhong, Kai Zhou, Xiaoming Wei, Jie Hu,
Abstract summary: CharGen is a highly accurate character-level visual text generation and editing model.<n>It employs a character-level multimodal encoder that not only extracts character-level text embeddings but also encodes glyph images character by character.<n>CharGen significantly improves text rendering accuracy, outperforming recent methods in public benchmarks such as AnyText-benchmark and MARIO-Eval.
Score: 21.851105023801562
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, significant advancements have been made in diffusion-based visual text generation models. Although the effectiveness of these methods in visual text rendering is rapidly improving, they still encounter challenges such as inaccurate characters and strokes when rendering complex visual text. In this paper, we propose CharGen, a highly accurate character-level visual text generation and editing model. Specifically, CharGen employs a character-level multimodal encoder that not only extracts character-level text embeddings but also encodes glyph images character by character. This enables it to capture fine-grained cross-modality features more effectively. Additionally, we introduce a new perceptual loss in CharGen to enhance character shape supervision and address the issue of inaccurate strokes in generated text. It is worth mentioning that CharGen can be integrated into existing diffusion models to generate visual text with high accuracy. CharGen significantly improves text rendering accuracy, outperforming recent methods in public benchmarks such as AnyText-benchmark and MARIO-Eval, with improvements of more than 8% and 6%, respectively. Notably, CharGen achieved a 5.5% increase in accuracy on Chinese test sets.

Related papers

EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering [9.087419148444225]
This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer)<n>We propose character positioning encoding and position encoding techniques to achieve controllable and precise text rendering.<n>We construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images.
arXiv Detail & Related papers (2025-05-30T09:55:39Z)
GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing [23.64662356622401]
We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision.<n>Our method achieves an 18.02% improvement in sentence accuracy over the state-of-the-art scene text editing baseline.
arXiv Detail & Related papers (2025-05-08T03:11:58Z)
PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering [50.76106125697899]
Product posters, which integrate subject, scene, and text, are crucial promotional tools for attracting customers. Main challenge lies in accurately rendering text, especially for complex writing systems like Chinese, which contains over 10,000 individual characters. We develop TextRenderNet, which achieves a high text rendering accuracy of over 90%. Based on TextRenderNet and SceneGenNet, we present PosterMaker, an end-to-end generation framework.
arXiv Detail & Related papers (2025-04-09T07:13:08Z)
Type-R: Automatically Retouching Typos for Text-to-Image Generation [14.904165023640854]
We propose to retouch erroneous text renderings in the post-processing pipeline. Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words.
arXiv Detail & Related papers (2024-11-27T09:11:45Z)
FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting [14.054151352916296]
This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer-Decoder architecture. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts. Our results indicate that FastTextSpotter achieves superior accuracy in detecting and recognizing multilingual scene text.
arXiv Detail & Related papers (2024-08-27T12:28:41Z)
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering [59.088036977605405]
Visual text rendering poses a fundamental challenge for text-to-image generation models. We craft a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation.
arXiv Detail & Related papers (2024-03-14T17:55:33Z)
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning [63.41513909279474]
Text-to-image diffusion models are typically trained to optimize the log-likelihood objective. Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation. We demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results.
arXiv Detail & Related papers (2023-11-27T09:39:45Z)
AnyText: Multilingual Visual Text Generation And Editing [18.811943975513483]
We introduce AnyText, a diffusion-based multilingual visual text generation and editing model. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. We contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages.
arXiv Detail & Related papers (2023-11-06T12:10:43Z)
TextDiffuser: Diffusion Models as Text Painters [118.30923824681642]
We introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. We contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs. We show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
arXiv Detail & Related papers (2023-05-18T10:16:19Z)
GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation. Our key idea is to render the target text as a glyph image containing visual language content. Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z)
Character-Aware Models Improve Visual Text Rendering [57.19915686282047]
Current image generation models struggle to reliably produce well-formed visual text. Character-aware models provide large gains on a novel spelling task. Our models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors on rare words.
arXiv Detail & Related papers (2022-12-20T18:59:23Z)
GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation [76.7772833556714]
We introduce GENIUS: a conditional text generation model using sketches as input. GENIUS is pre-trained on a large-scale textual corpus with a novel reconstruction from sketch objective. We show that GENIUS can be used as a strong and ready-to-use data augmentation tool for various natural language processing (NLP) tasks.
arXiv Detail & Related papers (2022-11-18T16:39:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.