A Scene-Text Synthesis Engine Achieved Through Learning from Decomposed
Real-World Data
- URL: http://arxiv.org/abs/2209.02397v2
- Date: Tue, 17 Oct 2023 11:09:43 GMT
- Title: A Scene-Text Synthesis Engine Achieved Through Learning from Decomposed
Real-World Data
- Authors: Zhengmi Tang, Tomo Miyazaki, and Shinichiro Omachi
- Abstract summary: Scene-text image synthesis techniques aim to naturally compose text instances on background scene images.
We propose a Learning-Based Text Synthesis engine (LBTS) that includes a text location proposal network (TLPNet) and a text appearance adaptation network (TAANet)
After training, those networks can be integrated and utilized to generate the synthetic dataset for scene text analysis tasks.
- Score: 4.096453902709292
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Scene-text image synthesis techniques that aim to naturally compose text
instances on background scene images are very appealing for training deep
neural networks due to their ability to provide accurate and comprehensive
annotation information. Prior studies have explored generating synthetic text
images on two-dimensional and three-dimensional surfaces using rules derived
from real-world observations. Some of these studies have proposed generating
scene-text images through learning; however, owing to the absence of a suitable
training dataset, unsupervised frameworks have been explored to learn from
existing real-world data, which might not yield reliable performance. To ease
this dilemma and facilitate research on learning-based scene text synthesis, we
introduce DecompST, a real-world dataset prepared from some public benchmarks,
containing three types of annotations: quadrilateral-level BBoxes, stroke-level
text masks, and text-erased images. Leveraging the DecompST dataset, we propose
a Learning-Based Text Synthesis engine (LBTS) that includes a text location
proposal network (TLPNet) and a text appearance adaptation network (TAANet).
TLPNet first predicts the suitable regions for text embedding, after which
TAANet adaptively adjusts the geometry and color of the text instance to match
the background context. After training, those networks can be integrated and
utilized to generate the synthetic dataset for scene text analysis tasks.
Comprehensive experiments were conducted to validate the effectiveness of the
proposed LBTS along with existing methods, and the experimental results
indicate the proposed LBTS can generate better pretraining data for scene text
detectors.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Stroke-Based Scene Text Erasing Using Synthetic Data [0.0]
Scene text erasing can replace text regions with reasonable content in natural images.
The lack of a large-scale real-world scene-text removal dataset allows the existing methods to not work in full strength.
We enhance and make full use of the synthetic text and consequently train our model only on the dataset generated by the improved synthetic text engine.
This model can partially erase text instances in a scene image with a bounding box provided or work with an existing scene text detector for automatic scene text erasing.
arXiv Detail & Related papers (2021-04-23T09:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.