Semi-Supervised Adaptation of Diffusion Models for Handwritten Text Generation
- URL: http://arxiv.org/abs/2412.15853v1
- Date: Fri, 20 Dec 2024 12:48:58 GMT
- Title: Semi-Supervised Adaptation of Diffusion Models for Handwritten Text Generation
- Authors: Kai Brandenbusch,
- Abstract summary: We present an extension of a latent DM for handwritten text generation.
Our proposed content encoder allows for different ways of conditioning the DM on textual and calligraphic features.
For adapting the model to a new unlabeled data set, we propose a semi-supervised training scheme.
- Score: 0.0
- License:
- Abstract: The generation of images of realistic looking, readable handwritten text is a challenging task which is referred to as handwritten text generation (HTG). Given a string and examples from a writer, the goal is to synthesize an image depicting the correctly spelled word in handwriting with the calligraphic style of the desired writer. An important application of HTG is the generation of training images in order to adapt downstream models for new data sets. With their success in natural image generation, diffusion models (DMs) have become the state-of-the-art approach in HTG. In this work, we present an extension of a latent DM for HTG to enable generation of writing styles not seen during training by learning style conditioning with a masked auto encoder. Our proposed content encoder allows for different ways of conditioning the DM on textual and calligraphic features. Additionally, we employ classifier-free guidance and explore the influence on the quality of the generated training images. For adapting the model to a new unlabeled data set, we propose a semi-supervised training scheme. We evaluate our approach on the IAM-database and use the RIMES-database to examine the generation of data not seen during training achieving improvements in this particularly promising application of DMs for HTG.
Related papers
- DiffusionPen: Towards Controlling the Style of Handwritten Text Generation [7.398476020996681]
DiffusionPen (DiffPen) is a 5-shot style handwritten text generation approach based on Latent Diffusion Models.
Our approach captures both textual and stylistic characteristics of seen and unseen words and styles, generating realistic handwritten samples.
Our method outperforms existing methods qualitatively and quantitatively, and its additional generated data can improve the performance of Handwriting Text Recognition (HTR) systems.
arXiv Detail & Related papers (2024-09-09T20:58:25Z) - ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models [52.23899502520261]
We introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - HWD: A Novel Evaluation Score for Styled Handwritten Text Generation [36.416044687373535]
Styled Handwritten Text Generation (Styled HTG) is an important task in document analysis, aiming to generate text images with the handwriting of given reference images.
We devise the Handwriting Distance (HWD), tailored for HTG evaluation.
In particular, it works in the feature space of a network specifically trained to extract handwriting style features from the variable-lenght input images and exploits a perceptual distance to compare the subtle geometric features of handwriting.
arXiv Detail & Related papers (2023-10-31T09:44:27Z) - Self-Supervised Representation Learning for Online Handwriting Text
Classification [0.8594140167290099]
We propose the novel Part of Stroke Masking (POSM) as a pretext task for pretraining models to extract informative representations from the online handwriting of individuals in English and Chinese languages.
To evaluate the quality of the extracted representations, we use both intrinsic and extrinsic evaluation methods.
The pretrained models are fine-tuned to achieve state-of-the-art results in tasks such as writer identification, gender classification, and handedness classification.
arXiv Detail & Related papers (2023-10-10T14:07:49Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z) - WordStylist: Styled Verbatim Handwritten Text Generation with Latent
Diffusion Models [8.334487584550185]
We present a latent diffusion-based method for styled text-to-text-content-image generation on word-level.
Our proposed method is able to generate realistic word image samples from different writer styles.
We show that the proposed model produces samples that are aesthetically pleasing, help boosting text recognition performance, and get similar writer retrieval score as real data.
arXiv Detail & Related papers (2023-03-29T10:19:26Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation [0.9542023122304099]
We present ScrabbleGAN, a semi-supervised approach to synthesize handwritten text images.
ScrabbleGAN relies on a novel generative model which can generate images of words with an arbitrary length.
arXiv Detail & Related papers (2020-03-23T21:41:19Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.