One-Shot Multilingual Font Generation Via ViT
- URL: http://arxiv.org/abs/2412.11342v1
- Date: Sun, 15 Dec 2024 23:52:35 GMT
- Title: One-Shot Multilingual Font Generation Via ViT
- Authors: Zhiheng Wang, Jiarui Liu,
- Abstract summary: Font design poses unique challenges for logographic languages like Chinese, Japanese, and Korean.
This paper introduces a novel Vision Transformer (ViT)-based model for multi-language font generation.
- Score: 2.023301270280465
- License:
- Abstract: Font design poses unique challenges for logographic languages like Chinese, Japanese, and Korean (CJK), where thousands of unique characters must be individually crafted. This paper introduces a novel Vision Transformer (ViT)-based model for multi-language font generation, effectively addressing the complexities of both logographic and alphabetic scripts. By leveraging ViT and pretraining with a strong visual pretext task (Masked Autoencoding, MAE), our model eliminates the need for complex design components in prior frameworks while achieving comprehensive results with enhanced generalizability. Remarkably, it can generate high-quality fonts across multiple languages for unseen, unknown, and even user-crafted characters. Additionally, we integrate a Retrieval-Augmented Guidance (RAG) module to dynamically retrieve and adapt style references, improving scalability and real-world applicability. We evaluated our approach in various font generation tasks, demonstrating its effectiveness, adaptability, and scalability.
Related papers
- Bringing Characters to New Stories: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting [71.29100512700064]
We present T-Prompter, a training-free method for theme-specific image generation.
T-Prompter integrates reference images into generative models, allowing users to seamlessly specify the target theme.
Our approach enables consistent story generation, character design, realistic character generation, and style-guided image generation.
arXiv Detail & Related papers (2025-01-26T19:01:19Z) - Towards Visual Text Design Transfer Across Languages [49.78504488452978]
We introduce a novel task of Multimodal Style Translation (MuST-Bench)
MuST-Bench is a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems.
In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions.
arXiv Detail & Related papers (2024-10-24T15:15:01Z) - StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond [68.0107158115377]
We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images.
We enhance the perception and comprehension abilities of StrucTexTv3 through instruction learning.
Our method achieved SOTA results in text-rich image perception tasks, and significantly improved performance in comprehension tasks.
arXiv Detail & Related papers (2024-05-31T16:55:04Z) - DiffCJK: Conditional Diffusion Model for High-Quality and Wide-coverage CJK Character Generation [1.0044057719679087]
We propose a novel diffusion method for generating glyphs in a targeted style from a single conditioned, standard glyph form.
Our approach shows remarkable zero-shot generalization capabilities for non-CJK but Chinese-inspired scripts.
In summary, our proposed method opens the door to high-quality, generative model-assisted font creation for CJK characters.
arXiv Detail & Related papers (2024-04-08T05:58:07Z) - FontCLIP: A Semantic Typography Visual-Language Model for Multilingual
Font Applications [27.609008096617057]
FontCLIP is a model that connects the semantic understanding of a large vision-language model with typographical knowledge.
We integrate typography-specific knowledge into the comprehensive vision-language knowledge of a pretrained CLIP model.
FontCLIP's dual-modality and generalization abilities enable multilingual and cross-lingual font retrieval and letter shape optimization.
arXiv Detail & Related papers (2024-03-11T06:08:16Z) - VLIS: Unimodal Language Models Guide Multimodal Language Generation [23.094728230459125]
We introduce Visual-Language models as Importance Sampling weights (VLIS)
It combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training.
VLIS improves vision-language models on diverse tasks, including commonsense understanding and complex text generation.
arXiv Detail & Related papers (2023-10-15T07:58:52Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - GAS-Net: Generative Artistic Style Neural Networks for Fonts [8.569974263629218]
This project aims to develop a few-shot cross-lingual font generator based on AGIS-Net.
Our approaches include redesigning the encoder and the loss function.
We will validate our method on multiple languages and datasets mentioned.
arXiv Detail & Related papers (2022-12-06T11:23:16Z) - Scalable Font Reconstruction with Dual Latent Manifolds [55.29525824849242]
We propose a deep generative model that performs typography analysis and font reconstruction.
Our approach enables us to massively scale up the number of character types we can effectively model.
We evaluate on the task of font reconstruction over various datasets representing character types of many languages.
arXiv Detail & Related papers (2021-09-10T20:37:43Z) - Adaptive Text Recognition through Visual Matching [86.40870804449737]
We introduce a new model that exploits the repetitive nature of characters in languages.
By doing this, we turn text recognition into a shape matching problem.
We show that it can handle challenges that traditional architectures are not able to solve without expensive retraining.
arXiv Detail & Related papers (2020-09-14T17:48:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.