AnyText: Multilingual Visual Text Generation And Editing
- URL: http://arxiv.org/abs/2311.03054v5
- Date: Wed, 21 Feb 2024 07:12:45 GMT
- Title: AnyText: Multilingual Visual Text Generation And Editing
- Authors: Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, Xuansong Xie
- Abstract summary: We introduce AnyText, a diffusion-based multilingual visual text generation and editing model.
AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation.
We contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages.
- Score: 18.811943975513483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion model based Text-to-Image has achieved impressive achievements
recently. Although current technology for synthesizing images is highly
advanced and capable of generating images with high fidelity, it is still
possible to give the show away when focusing on the text area in the generated
image. To address this issue, we introduce AnyText, a diffusion-based
multilingual visual text generation and editing model, that focuses on
rendering accurate and coherent text in the image. AnyText comprises a
diffusion pipeline with two primary elements: an auxiliary latent module and a
text embedding module. The former uses inputs like text glyph, position, and
masked image to generate latent features for text generation or editing. The
latter employs an OCR model for encoding stroke data as embeddings, which blend
with image caption embeddings from the tokenizer to generate texts that
seamlessly integrate with the background. We employed text-control diffusion
loss and text perceptual loss for training to further enhance writing accuracy.
AnyText can write characters in multiple languages, to the best of our
knowledge, this is the first work to address multilingual visual text
generation. It is worth mentioning that AnyText can be plugged into existing
diffusion models from the community for rendering or editing text accurately.
After conducting extensive evaluation experiments, our method has outperformed
all other approaches by a significant margin. Additionally, we contribute the
first large-scale multilingual text images dataset, AnyWord-3M, containing 3
million image-text pairs with OCR annotations in multiple languages. Based on
AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual
text generation accuracy and quality. Our project will be open-sourced on
https://github.com/tyxsspa/AnyText to improve and promote the development of
text generation technology.
Related papers
- AnyText2: Visual Text Generation and Editing With Customizable Attributes [10.24874245687826]
This paper introduces AnyText2, a novel method that enables precise control over multilingual text attributes in natural scene image generation and editing.
Compared to its predecessor, AnyText, our new approach not only enhances image realism but also achieves a 19.8% increase in inference speed.
As an extension of AnyText, this method allows for customization of attributes for each line of text, leading to improvements of 3.3% and 9.3% in text accuracy for Chinese and English, respectively.
arXiv Detail & Related papers (2024-11-22T03:31:56Z) - TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles [12.182588762414058]
Scene text editing aims to modify texts on images while maintaining the style of newly generated text similar to the original.
Recent works leverage diffusion models, showing improved results, yet still face challenges.
We present emphTextMastero - a carefully designed multilingual scene text editing architecture based on latent diffusion models (LDMs)
arXiv Detail & Related papers (2024-08-20T08:06:09Z) - Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild.
The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability.
The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z) - AnyTrans: Translate AnyText in the Image with Large Scale Models [88.5887934499388]
This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI)
Our framework incorporates contextual cues from both textual and visual elements during translation.
We have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
arXiv Detail & Related papers (2024-06-17T11:37:48Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - TextDiffuser-2: Unleashing the Power of Language Models for Text
Rendering [118.30923824681642]
TextDiffuser-2 aims to unleash the power of language models for text rendering.
We utilize the language model within the diffusion model to encode the position and texts at the line level.
We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V.
arXiv Detail & Related papers (2023-11-28T04:02:40Z) - TextDiffuser: Diffusion Models as Text Painters [118.30923824681642]
We introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds.
We contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs.
We show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
arXiv Detail & Related papers (2023-05-18T10:16:19Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.