Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors
- URL: http://arxiv.org/abs/2312.05286v3
- Date: Wed, 10 Jul 2024 15:49:37 GMT
- Title: Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors
- Authors: Tongkun Guan, Wei Shen, Xue Yang, Xuehui Wang, Xiaokang Yang,
- Abstract summary: FreeReal is a real-domain-aligned pre-training paradigm that enables the complementary strengths of LSD and real data.
GlyphMix embeds synthetic images as graffiti-like units onto real images.
FreeReal consistently outperforms previous pre-training methods by a substantial margin across four public datasets.
- Score: 54.80516786370663
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose FreeReal, a real-domain-aligned pre-training paradigm that enables the complementary strengths of both LSD and unlabeled real data (URD). Specifically, to bridge real and synthetic worlds for pre-training, a glyph-based mixing mechanism (GlyphMix) is tailored for text images.GlyphMix delineates the character structures of synthetic images and embeds them as graffiti-like units onto real images. Without introducing real domain drift, GlyphMix freely yields real-world images with annotations derived from synthetic labels. Furthermore, when given free fine-grained synthetic labels, GlyphMix can effectively bridge the linguistic domain gap stemming from English-dominated LSD to URD in various languages. Without bells and whistles, FreeReal achieves average gains of 1.97%, 3.90%, 3.85%, and 4.56% in improving the performance of FCENet, PSENet, PANet, and DBNet methods, respectively, consistently outperforming previous pre-training methods by a substantial margin across four public datasets. Code is available at https://github.com/SJTU-DeepVisionLab/FreeReal.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text.
Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously.
We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z) - Self-Supervised Text Erasing with Controllable Image Synthesis [33.60862002159276]
We study an unsupervised scenario by proposing a novel Self-supervised Text Erasing framework.
We first design a style-aware image synthesis function to generate synthetic images with diverse styled texts.
To bridge the text style gap between the synthetic and real-world data, a policy network is constructed to control the synthetic mechanisms.
The proposed method has been extensively evaluated with both PosterErase and the widely-used SCUT-Entext dataset.
arXiv Detail & Related papers (2022-04-27T07:21:55Z) - Realistic Blur Synthesis for Learning Image Deblurring [20.560205377203957]
We present a novel blur synthesis pipeline that can synthesize more realistic blur.
We also present RSBlur, a novel dataset that contains real blurred images and the corresponding sequences of sharp images.
arXiv Detail & Related papers (2022-02-17T17:14:48Z) - Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text
Detection in the Wild [11.045516338817132]
We propose a synthetic-to-real domain adaptation method for scene text detection.
A text self-training (TST) method and adversarial text instance alignment (ATA) for domain adaptive scene text detection are introduced.
Results demonstrate the effectiveness of the proposed method with up to 10% improvement.
arXiv Detail & Related papers (2020-09-03T16:16:34Z) - Syn2Real Transfer Learning for Image Deraining using Gaussian Processes [92.15895515035795]
CNN-based methods for image deraining have achieved excellent performance in terms of reconstruction error as well as visual quality.
Due to challenges in obtaining real world fully-labeled image deraining datasets, existing methods are trained only on synthetically generated data.
We propose a Gaussian Process-based semi-supervised learning framework which enables the network in learning to derain using synthetic dataset.
arXiv Detail & Related papers (2020-06-10T00:33:18Z) - UnrealText: Synthesizing Realistic Scene Text Images from the Unreal
World [18.608641449975124]
UnrealText is an efficient image synthesis method that renders realistic images via a 3D graphics engine.
The comprehensive experiments verify its effectiveness on both scene text detection and recognition.
We generate a multilingual version for future research into multilingual scene text detection and recognition.
arXiv Detail & Related papers (2020-03-24T01:37:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.