JoyType: A Robust Design for Multilingual Visual Text Creation
- URL: http://arxiv.org/abs/2409.17524v1
- Date: Thu, 26 Sep 2024 04:23:17 GMT
- Title: JoyType: A Robust Design for Multilingual Visual Text Creation
- Authors: Chao Li, Chen Jiang, Xiaolong Liu, Jun Zhao, Guoxin Wang,
- Abstract summary: We introduce a novel approach for multilingual visual text creation, named JoyType.
JoyType is designed to maintain the font style of text during the image generation process.
Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art methods.
- Score: 14.441897362967344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating images with accurately represented text, especially in non-Latin languages, poses a significant challenge for diffusion models. Existing approaches, such as the integration of hint condition diagrams via auxiliary networks (e.g., ControlNet), have made strides towards addressing this issue. However, diffusion models often fall short in tasks requiring controlled text generation, such as specifying particular fonts or producing text in small fonts. In this paper, we introduce a novel approach for multilingual visual text creation, named JoyType, designed to maintain the font style of text during the image generation process. Our methodology begins with assembling a training dataset, JoyType-1M, comprising 1 million pairs of data. Each pair includes an image, its description, and glyph instructions corresponding to the font style within the image. We then developed a text control network, Font ControlNet, tasked with extracting font style information to steer the image generation. To further enhance our model's ability to maintain font style, notably in generating small-font text, we incorporated a multi-layer OCR-aware loss into the diffusion process. This enhancement allows JoyType to direct text rendering using low-level descriptors. Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art methods. Additionally, JoyType can function as a plugin, facilitating the creation of varied image styles in conjunction with other stable diffusion models on HuggingFace and CivitAI. Our project is open-sourced on https://jdh-algo.github.io/JoyType/.
Related papers
- CustomText: Customized Textual Image Generation using Diffusion Models [13.239661107392324]
Textual image generation spans diverse fields like advertising, education, product packaging, social media, information visualization, and branding.
Despite recent strides in language-guided image synthesis using diffusion models, current models excel in image generation but struggle with accurate text rendering and offer limited control over font attributes.
In this paper, we aim to enhance the synthesis of high-quality images with precise text customization, thereby contributing to the advancement of image generation models.
arXiv Detail & Related papers (2024-05-21T06:43:03Z) - FontDiffuser: One-Shot Font Generation via Denoising Diffusion with
Multi-Scale Content Aggregation and Style Contrastive Learning [45.696909070215476]
FontDiffuser is a diffusion-based image-to-image one-shot font generation method.
It consistently excels on complex characters and large style changes compared to previous methods.
arXiv Detail & Related papers (2023-12-19T13:23:20Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z) - CF-Font: Content Fusion for Few-shot Font Generation [63.79915037830131]
We propose a content fusion module (CFM) to project the content feature into a linear space defined by the content features of basis fonts.
Our method also allows to optimize the style representation vector of reference images.
We have evaluated our method on a dataset of 300 fonts with 6.5k characters each.
arXiv Detail & Related papers (2023-03-24T14:18:40Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - DGFont++: Robust Deformable Generative Networks for Unsupervised Font
Generation [19.473023811252116]
We propose a robust deformable generative network for unsupervised font generation (abbreviated as DGFont++)
To distinguish different styles, we train our model with a multi-task discriminator, which ensures that each style can be discriminated independently.
Experiments demonstrate that our model is able to generate character images of higher quality than state-of-the-art methods.
arXiv Detail & Related papers (2022-12-30T14:35:10Z) - Diff-Font: Diffusion Model for Robust One-Shot Font Generation [110.45944936952309]
We propose a novel one-shot font generation method based on a diffusion model, named Diff-Font.
The proposed model aims to generate the entire font library by giving only one sample as the reference.
The well-trained Diff-Font is not only robust to font gap and font variation, but also achieved promising performance on difficult character generation.
arXiv Detail & Related papers (2022-12-12T13:51:50Z) - Font Completion and Manipulation by Cycling Between Multi-Modality
Representations [113.26243126754704]
We innovate to explore the generation of font glyphs as 2D graphic objects with the graph as an intermediate representation.
We formulate a cross-modality cycled image-to-image structure with a graph between an image encoder and an image.
Our model generates improved results than both image-to-image baseline and previous state-of-the-art methods for glyph completion.
arXiv Detail & Related papers (2021-08-30T02:43:29Z) - DG-Font: Deformable Generative Networks for Unsupervised Font Generation [14.178381391124036]
We propose novel deformable generative networks for unsupervised font generation (DGFont)
We introduce a feature deformation skip connection (FDSC) which predicts pairs of displacement maps and employs the predicted maps to apply deformable convolution to the low-level feature maps from the content encoder.
Experiments demonstrate that our model generates characters in higher quality than state-of-art methods.
arXiv Detail & Related papers (2021-04-07T11:32:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.