Related papers: Visual Text Generation in the Wild

Visual Text Generation in the Wild

URL: http://arxiv.org/abs/2407.14138v2
Date: Sun, 3 Nov 2024 08:12:57 GMT
Title: Visual Text Generation in the Wild
Authors: Yuanzhi Zhu, Jiawei Liu, Feiyu Gao, Wenyu Liu, Xinggang Wang, Peng Wang, Fei Huang, Cong Yao, Zhibo Yang,
Abstract summary: We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild. The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. The generated images provide superior utility for tasks involving text detection and text recognition.
Score: 67.37458807253064
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.

Related papers

EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models [31.31018600797305]
We propose a prompt inversion technique called sys for text-to-image diffusion models.<n>Our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability.
arXiv Detail & Related papers (2025-06-03T16:44:15Z)
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models [76.68654868991517]
Long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. We develop ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity.
arXiv Detail & Related papers (2025-03-26T03:44:25Z)
SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild [55.619708995575785]
The text in natural scene images needs to meet the following four key criteria. The generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks. The generated images have superior utility in OCR tasks like text detection and text recognition.
arXiv Detail & Related papers (2025-01-06T12:09:08Z)
Text Image Generation for Low-Resource Languages with Dual Translation Learning [0.0]
This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages. The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images. To enhance the accuracy and variety of generated text images, we introduce two guidance techniques.
arXiv Detail & Related papers (2024-09-26T11:23:59Z)
Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z)
Conditional Text Image Generation with Diffusion Models [18.017541111064602]
We propose a method called Text Image Generation with Conditional Models (CTIG-DM) Four text image generation modes, namely: synthesis mode, augmentation mode, recovery mode, and imitation mode, can be derived by combining and configuring these three conditions. CTIG-DM is able to produce image samples that simulate real-world complexity and diversity, and thus can boost the performance of existing text recognizers.
arXiv Detail & Related papers (2023-06-19T09:44:43Z)
GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation. Our key idea is to render the target text as a glyph image containing visual language content. Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z)
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences. To be more specific, both input texts and images are encoded into one unified multi-modal latent space. Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z)
SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map. We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z)
Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.