Zero-Shot Styled Text Image Generation, but Make It Autoregressive
- URL: http://arxiv.org/abs/2503.17074v2
- Date: Mon, 24 Mar 2025 17:23:51 GMT
- Title: Zero-Shot Styled Text Image Generation, but Make It Autoregressive
- Authors: Vittorio Pippi, Fabio Quattrini, Silvia Cascianelli, Alessio Tonioni, Rita Cucchiara,
- Abstract summary: Styled Handwritten Text Generation (HTG) has recently received attention from the computer vision and document analysis communities.<n>We propose a novel framework for text image generation, dubbed Emuru.<n>Our approach leverages a powerful text image representation model (a variational autoencoder) combined with an autoregressive Transformer.
- Score: 34.09957000751439
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Styled Handwritten Text Generation (HTG) has recently received attention from the computer vision and document analysis communities, which have developed several solutions, either GAN- or diffusion-based, that achieved promising results. Nonetheless, these strategies fail to generalize to novel styles and have technical constraints, particularly in terms of maximum output length and training efficiency. To overcome these limitations, in this work, we propose a novel framework for text image generation, dubbed Emuru. Our approach leverages a powerful text image representation model (a variational autoencoder) combined with an autoregressive Transformer. Our approach enables the generation of styled text images conditioned on textual content and style examples, such as specific fonts or handwriting styles. We train our model solely on a diverse, synthetic dataset of English text rendered in over 100,000 typewritten and calligraphy fonts, which gives it the capability to reproduce unseen styles (both fonts and users' handwriting) in zero-shot. To the best of our knowledge, Emuru is the first autoregressive model for HTG, and the first designed specifically for generalization to novel styles. Moreover, our model generates images without background artifacts, which are easier to use for downstream applications. Extensive evaluation on both typewritten and handwritten, any-length text image generation scenarios demonstrates the effectiveness of our approach.
Related papers
- Text-Conditioned Diffusion Model for High-Fidelity Korean Font Generation [7.281838207050202]
Automatic font generation (AFG) is the process of creating a new font using only a few examples of the style images.
We present a diffusion-based AFG method which generates high-quality, diverse Korean font images.
Key innovation is our text encoder, which processes phonetic representations to generate accurate and contextually correct characters.
arXiv Detail & Related papers (2025-04-30T05:24:49Z) - Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models [76.68654868991517]
Long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models.
We introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features.
We develop ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity.
arXiv Detail & Related papers (2025-03-26T03:44:25Z) - One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts.
They struggle to support the consistent generation of identity-preserving requirements for storytelling.
We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z) - JoyType: A Robust Design for Multilingual Visual Text Creation [14.441897362967344]
We introduce a novel approach for multilingual visual text creation, named JoyType.
JoyType is designed to maintain the font style of text during the image generation process.
Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2024-09-26T04:23:17Z) - ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models [52.23899502520261]
We introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically.<n>We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.<n>This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - CustomText: Customized Textual Image Generation using Diffusion Models [13.239661107392324]
Textual image generation spans diverse fields like advertising, education, product packaging, social media, information visualization, and branding.
Despite recent strides in language-guided image synthesis using diffusion models, current models excel in image generation but struggle with accurate text rendering and offer limited control over font attributes.
In this paper, we aim to enhance the synthesis of high-quality images with precise text customization, thereby contributing to the advancement of image generation models.
arXiv Detail & Related papers (2024-05-21T06:43:03Z) - ControlStyle: Text-Driven Stylized Image Generation Using Diffusion
Priors [105.37795139586075]
We propose a new task for stylizing'' text-to-image models, namely text-driven stylized image generation.
We present a new diffusion model (ControlStyle) via upgrading a pre-trained text-to-image model with a trainable modulation network.
Experiments demonstrate the effectiveness of our ControlStyle in producing more visually pleasing and artistic results.
arXiv Detail & Related papers (2023-11-09T15:50:52Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z) - GenText: Unsupervised Artistic Text Generation via Decoupled Font and
Texture Manipulation [30.654807125764965]
We propose a novel approach, namely GenText, to achieve general artistic text style transfer.
Specifically, our work incorporates three different stages, stylization, destylization, and font transfer.
Considering the difficult data acquisition of paired artistic text images, our model is designed under the unsupervised setting.
arXiv Detail & Related papers (2022-07-20T04:42:47Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.