Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering
- URL: http://arxiv.org/abs/2406.10208v2
- Date: Fri, 12 Jul 2024 16:26:40 GMT
- Title: Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering
- Authors: Zeyu Liu, Weicong Liang, Yiming Zhao, Bohan Chen, Lin Liang, Lijuan Wang, Ji Li, Yuhui Yuan,
- Abstract summary: Glyph-ByT5 has achieved highly accurate visual text rendering performance in graphic design images.
It still focuses solely on English and performs relatively poorly in terms of visual appeal.
We present Glyph-ByT5-v2 and Glyph-SDXL-v2, which support accurate visual text rendering for 10 different languages.
- Score: 46.259028433965796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, Glyph-ByT5 has achieved highly accurate visual text rendering performance in graphic design images. However, it still focuses solely on English and performs relatively poorly in terms of visual appeal. In this work, we address these two fundamental limitations by presenting Glyph-ByT5-v2 and Glyph-SDXL-v2, which not only support accurate visual text rendering for 10 different languages but also achieve much better aesthetic quality. To achieve this, we make the following contributions: (i) creating a high-quality multilingual glyph-text and graphic design dataset consisting of more than 1 million glyph-text pairs and 10 million graphic design image-text pairs covering nine other languages, (ii) building a multilingual visual paragraph benchmark consisting of 1,000 prompts, with 100 for each language, to assess multilingual visual spelling accuracy, and (iii) leveraging the latest step-aware preference learning approach to enhance the visual aesthetic quality. With the combination of these techniques, we deliver a powerful customized multilingual text encoder, Glyph-ByT5-v2, and a strong aesthetic graphic generation model, Glyph-SDXL-v2, that can support accurate spelling in 10 different languages. We perceive our work as a significant advancement, considering that the latest DALL-E3 and Ideogram 1.0 still struggle with the multilingual visual text rendering task.
Related papers
- Qwen-Image Technical Report [86.46471547116158]
We present Qwen-Image, an image generation foundation model that achieves significant advances in complex text rendering and precise image editing.<n>We design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing.<n>Qwen-Image performs exceptionally well in alphabetic languages such as English, and also achieves remarkable progress on more challenging logographic languages like Chinese.
arXiv Detail & Related papers (2025-08-04T11:49:20Z) - EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering [9.087419148444225]
This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer)<n>We propose character positioning encoding and position encoding techniques to achieve controllable and precise text rendering.<n>We construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images.
arXiv Detail & Related papers (2025-05-30T09:55:39Z) - HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models [20.543157470365315]
HDGlyph is a novel framework that hierarchically decouples text generation from non-text visual synthesis.<n>Our model consistently outperforms others, with 5.08% and 11.7% accuracy gains in English and Chinese text rendering.
arXiv Detail & Related papers (2025-05-10T07:05:43Z) - Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model [69.09404597939744]
Seedream 2.0 is a native Chinese-English bilingual image generation foundation model.
It adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering.
It is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data.
arXiv Detail & Related papers (2025-03-10T17:58:33Z) - Visual Lexicon: Rich Image Features in Language Space [99.94214846451347]
ViLex simultaneously captures rich semantic content and fine visual details.
ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model.
As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages.
arXiv Detail & Related papers (2024-12-09T18:57:24Z) - AnyText2: Visual Text Generation and Editing With Customizable Attributes [10.24874245687826]
This paper introduces AnyText2, a novel method that enables precise control over multilingual text attributes in natural scene image generation and editing.
Compared to its predecessor, AnyText, our new approach not only enhances image realism but also achieves a 19.8% increase in inference speed.
As an extension of AnyText, this method allows for customization of attributes for each line of text, leading to improvements of 3.3% and 9.3% in text accuracy for Chinese and English, respectively.
arXiv Detail & Related papers (2024-11-22T03:31:56Z) - Towards Visual Text Design Transfer Across Languages [49.78504488452978]
We introduce a novel task of Multimodal Style Translation (MuST-Bench)
MuST-Bench is a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems.
In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions.
arXiv Detail & Related papers (2024-10-24T15:15:01Z) - Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models [38.52953013858373]
We introduce Playground v3 (PGv3), our latest text-to-image model.
It achieves state-of-the-art (SoTA) performance across multiple testing benchmarks.
It excels in text prompt adherence, complex reasoning, and accurate text rendering.
arXiv Detail & Related papers (2024-09-16T19:52:24Z) - StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond [68.0107158115377]
We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images.
We enhance the perception and comprehension abilities of StrucTexTv3 through instruction learning.
Our method achieved SOTA results in text-rich image perception tasks, and significantly improved performance in comprehension tasks.
arXiv Detail & Related papers (2024-05-31T16:55:04Z) - Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering [59.088036977605405]
Visual text rendering poses a fundamental challenge for text-to-image generation models.
We craft a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder.
We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation.
arXiv Detail & Related papers (2024-03-14T17:55:33Z) - Paragraph-to-Image Generation with Information-Enriched Diffusion Model [67.9265336953134]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.
It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.
The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z) - AnyText: Multilingual Visual Text Generation And Editing [18.811943975513483]
We introduce AnyText, a diffusion-based multilingual visual text generation and editing model.
AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation.
We contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages.
arXiv Detail & Related papers (2023-11-06T12:10:43Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.