WriteViT: Handwritten Text Generation with Vision Transformer
- URL: http://arxiv.org/abs/2505.13235v1
- Date: Mon, 19 May 2025 15:17:53 GMT
- Title: WriteViT: Handwritten Text Generation with Vision Transformer
- Authors: Dang Hoai Nam, Huynh Tong Dang Khoa, Vo Nguyen Le Duy,
- Abstract summary: We introduce WriteViT, a one-shot handwritten text synthesis framework that incorporates Vision Transformers (ViT)<n>WriteViT produces high-quality, style-consistent handwriting while maintaining strong recognition performance in low-resource scenarios.<n>These results highlight the promise of transformer-based designs for multilingual handwriting generation and efficient style adaptation.
- Score: 7.10052009802944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans can quickly generalize handwriting styles from a single example by intuitively separating content from style. Machines, however, struggle with this task, especially in low-data settings, often missing subtle spatial and stylistic cues. Motivated by this gap, we introduce WriteViT, a one-shot handwritten text synthesis framework that incorporates Vision Transformers (ViT), a family of models that have shown strong performance across various computer vision tasks. WriteViT integrates a ViT-based Writer Identifier for extracting style embeddings, a multi-scale generator built with Transformer encoder-decoder blocks enhanced by conditional positional encoding (CPE), and a lightweight ViT-based recognizer. While previous methods typically rely on CNNs or CRNNs, our design leverages transformers in key components to better capture both fine-grained stroke details and higher-level style information. Although handwritten text synthesis has been widely explored, its application to Vietnamese -- a language rich in diacritics and complex typography -- remains limited. Experiments on Vietnamese and English datasets demonstrate that WriteViT produces high-quality, style-consistent handwriting while maintaining strong recognition performance in low-resource scenarios. These results highlight the promise of transformer-based designs for multilingual handwriting generation and efficient style adaptation.
Related papers
- Qwen-Image Technical Report [86.46471547116158]
We present Qwen-Image, an image generation foundation model that achieves significant advances in complex text rendering and precise image editing.<n>We design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing.<n>Qwen-Image performs exceptionally well in alphabetic languages such as English, and also achieves remarkable progress on more challenging logographic languages like Chinese.
arXiv Detail & Related papers (2025-08-04T11:49:20Z) - One-Shot Multilingual Font Generation Via ViT [2.023301270280465]
Font design poses unique challenges for logographic languages like Chinese, Japanese, and Korean.<n>This paper introduces a novel Vision Transformer (ViT)-based model for multi-language font generation.
arXiv Detail & Related papers (2024-12-15T23:52:35Z) - Handwriting Recognition in Historical Documents with Multimodal LLM [0.0]
Multimodal Language Models have demonstrated effectiveness in performing OCR and computer vision tasks with few shot prompting.
I evaluate the accuracy of handwritten document transcriptions generated by Gemini against the current state of the art Transformer based methods.
arXiv Detail & Related papers (2024-10-31T15:32:14Z) - Towards Visual Text Design Transfer Across Languages [49.78504488452978]
We introduce a novel task of Multimodal Style Translation (MuST-Bench)
MuST-Bench is a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems.
In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions.
arXiv Detail & Related papers (2024-10-24T15:15:01Z) - Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - ViTEraser: Harnessing the Power of Vision Transformers for Scene Text
Removal with SegMIM Pretraining [58.241008246380254]
Scene text removal (STR) aims at replacing text strokes in natural scenes with visually coherent backgrounds.
Recent STR approaches rely on iterative refinements or explicit text masks, resulting in high complexity and sensitivity to the accuracy of text localization.
We propose a simple-yet-effective ViT-based text eraser, dubbed ViTEraser.
arXiv Detail & Related papers (2023-06-21T08:47:20Z) - Handwritten Text Generation from Visual Archetypes [25.951540903019467]
We devise a Transformer-based model for Few-Shot styled handwritten text generation.
We obtain a robust representation of unseen writers' calligraphy by exploiting specific pre-training on a large synthetic dataset.
arXiv Detail & Related papers (2023-03-27T14:58:20Z) - Boosting Modern and Historical Handwritten Text Recognition with
Deformable Convolutions [52.250269529057014]
Handwritten Text Recognition (HTR) in free-volution pages is a challenging image understanding task.
We propose to adopt deformable convolutions, which can deform depending on the input at hand and better adapt to the geometric variations of the text.
arXiv Detail & Related papers (2022-08-17T06:55:54Z) - SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and
Out-of-Vocabulary Text [35.83345711291558]
We propose a novel method that can synthesize parameterized and controllable handwriting Styles for arbitrary-Length and Out-of-vocabulary text.
We embed the text content by providing an easily obtainable printed style image, so that the diversity of the content can be flexibly achieved.
Our method can synthesize words that are not included in the training vocabulary and with various new styles.
arXiv Detail & Related papers (2022-02-23T12:13:27Z) - Handwriting Transformers [98.3964093654716]
We propose a transformer-based styled handwritten text image generation approach, HWT, that strives to learn both style-content entanglement and global and local writing style patterns.
The proposed HWT captures the long and short range relationships within the style examples through a self-attention mechanism.
Our proposed HWT generates realistic styled handwritten text images and significantly outperforms the state-of-the-art demonstrated.
arXiv Detail & Related papers (2021-04-08T17:59:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.