Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models
- URL: http://arxiv.org/abs/2311.16555v1
- Date: Tue, 28 Nov 2023 06:51:28 GMT
- Title: Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models
- Authors: Ling Fu, Zijie Wu, Yingying Zhu, Yuliang Liu, Xiang Bai
- Abstract summary: We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
- Score: 63.99110667987318
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Scene text detection techniques have garnered significant attention due to
their wide-ranging applications. However, existing methods have a high demand
for training data, and obtaining accurate human annotations is labor-intensive
and time-consuming. As a solution, researchers have widely adopted synthetic
text images as a complementary resource to real text images during
pre-training. Yet there is still room for synthetic datasets to enhance the
performance of scene text detectors. We contend that one main limitation of
existing generation methods is the insufficient integration of foreground text
with the background. To alleviate this problem, we present the Diffusion Model
based Text Generator (DiffText), a pipeline that utilizes the diffusion model
to seamlessly blend foreground text regions with the background's intrinsic
features. Additionally, we propose two strategies to generate visually coherent
text with fewer spelling errors. With fewer text instances, our produced text
images consistently surpass other synthetic data in aiding text detectors.
Extensive experiments on detecting horizontal, rotated, curved, and line-level
texts demonstrate the effectiveness of DiffText in producing realistic text
images.
Related papers
- Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model [31.819060415422353]
Diff-Text is a training-free scene text generation framework for any language.
Our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
arXiv Detail & Related papers (2023-12-19T15:18:40Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - TextDiffuser: Diffusion Models as Text Painters [118.30923824681642]
We introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds.
We contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs.
We show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
arXiv Detail & Related papers (2023-05-18T10:16:19Z) - Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text.
Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously.
We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - DUET: Detection Utilizing Enhancement for Text in Scanned or Captured
Documents [1.4866448722906016]
Our proposed model is designed to perform noise reduction and text region enhancement as well as text detection.
We enrich the training data for the model with synthesized document images that are fully labeled for text detection and enhancement.
Our methods are demonstrated in a real document dataset with performances exceeding those of other text detection methods.
arXiv Detail & Related papers (2021-06-10T07:08:31Z) - Stroke-Based Scene Text Erasing Using Synthetic Data [0.0]
Scene text erasing can replace text regions with reasonable content in natural images.
The lack of a large-scale real-world scene-text removal dataset allows the existing methods to not work in full strength.
We enhance and make full use of the synthetic text and consequently train our model only on the dataset generated by the improved synthetic text engine.
This model can partially erase text instances in a scene image with a bounding box provided or work with an existing scene text detector for automatic scene text erasing.
arXiv Detail & Related papers (2021-04-23T09:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.