Improving Text Generation on Images with Synthetic Captions
- URL: http://arxiv.org/abs/2406.00505v2
- Date: Wed, 23 Oct 2024 08:28:53 GMT
- Title: Improving Text Generation on Images with Synthetic Captions
- Authors: Jun Young Koh, Sang Hyun Park, Joy Song,
- Abstract summary: latent diffusion models such as SDXL and SD 1.5 have shown significant capability in generating realistic images.
We propose a low-cost approach by leveraging SDXL without any time-consuming training on large-scale datasets.
Our results demonstrate how our small scale fine-tuning approach can improve the accuracy of text generation in different scenarios.
- Score: 2.1175632266708733
- License:
- Abstract: The recent emergence of latent diffusion models such as SDXL and SD 1.5 has shown significant capability in generating highly detailed and realistic images. Despite their remarkable ability to produce images, generating accurate text within images still remains a challenging task. In this paper, we examine the validity of fine-tuning approaches in generating legible text within the image. We propose a low-cost approach by leveraging SDXL without any time-consuming training on large-scale datasets. The proposed strategy employs a fine-tuning technique that examines the effects of data refinement levels and synthetic captions. Moreover, our results demonstrate how our small scale fine-tuning approach can improve the accuracy of text generation in different scenarios without the need of additional multimodal encoders. Our experiments show that with the addition of random letters to our raw dataset, our model's performance improves in producing well-formed visual text.
Related papers
- Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss.
Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z) - ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models [52.23899502520261]
We introduce a new framework named ARTIST to focus on the learning of text structures.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization [62.157627519792946]
We introduce a novel framework called bridged transfer, which initially employs synthetic images for fine-tuning a pre-trained model to improve its transferability.
We propose dataset style inversion strategy to improve the stylistic alignment between synthetic and real images.
Our proposed methods are evaluated across 10 different datasets and 5 distinct models, demonstrating consistent improvements.
arXiv Detail & Related papers (2024-03-28T22:25:05Z) - TextCraftor: Your Text Encoder Can be Image Quality Controller [65.27457900325462]
Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation.
We propose a proposed fine-tuning approach, TextCraftor, to enhance the performance of text-to-image diffusion models.
arXiv Detail & Related papers (2024-03-27T19:52:55Z) - The Right Losses for the Right Gains: Improving the Semantic Consistency
of Deep Text-to-Image Generation with Distribution-Sensitive Losses [0.35898124827270983]
We propose a contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss and fake-to-real loss.
We test this approach on two baseline models: SSAGAN and AttnGAN.
Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset.
arXiv Detail & Related papers (2023-12-18T00:05:28Z) - OT-Attack: Enhancing Adversarial Transferability of Vision-Language
Models via Optimal Transport Optimization [65.57380193070574]
Vision-language pre-training models are vulnerable to multi-modal adversarial examples.
Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples.
We propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack.
arXiv Detail & Related papers (2023-12-07T16:16:50Z) - Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution [15.391125077873745]
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images.
Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance.
We introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios.
arXiv Detail & Related papers (2023-11-22T11:10:45Z) - Fill-Up: Balancing Long-Tailed Data with Generative Models [11.91669614267993]
This paper proposes a new image synthesis pipeline for long-tailed situations using Textual Inversion.
We show that generated images from textual-inverted text tokens effectively aligns with the real domain.
We also show that real-world data imbalance scenarios can be successfully mitigated by filling up the imbalanced data with synthetic images.
arXiv Detail & Related papers (2023-06-12T16:01:20Z) - Multimodal Data Augmentation for Image Captioning using Diffusion Models [12.221685807426264]
We propose a data augmentation method, leveraging a text-to-image model called Stable Diffusion, to expand the training set.
Experiments on the MS COCO dataset demonstrate the advantages of our approach over several benchmark methods.
Further improvement regarding the training efficiency and effectiveness can be obtained after intentionally filtering the generated data.
arXiv Detail & Related papers (2023-05-03T01:57:33Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Text-to-Image Generation via Implicit Visual Guidance and Hypernetwork [38.55086153299993]
We develop an approach for text-to-image generation that embraces additional retrieval images, driven by a combination of implicit visual guidance loss and generative objectives.
We propose a novel hypernetwork modulated visual-text encoding scheme to predict the weight update of the encoding layer.
Experimental results show that our model guided with additional retrieval visual data outperforms existing GAN-based models.
arXiv Detail & Related papers (2022-08-17T19:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.