Weakly Supervised Scene Text Generation for Low-resource Languages
- URL: http://arxiv.org/abs/2306.14269v2
- Date: Tue, 27 Jun 2023 15:34:17 GMT
- Title: Weakly Supervised Scene Text Generation for Low-resource Languages
- Authors: Yangchen Xie, Xinyuan Chen, Hongjian Zhan, Palaiahankote Shivakum,
Bing Yin, Cong Liu, Yue Lu
- Abstract summary: A large number of annotated training images is crucial for training successful scene text recognition models.
Existing scene text generation methods typically rely on a large amount of paired data, which is difficult to obtain for low-resource languages.
We propose a novel weakly supervised scene text generation method that leverages a few recognition-level labels as weak supervision.
- Score: 19.243705770491577
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A large number of annotated training images is crucial for training
successful scene text recognition models. However, collecting sufficient
datasets can be a labor-intensive and costly process, particularly for
low-resource languages. To address this challenge, auto-generating text data
has shown promise in alleviating the problem. Unfortunately, existing scene
text generation methods typically rely on a large amount of paired data, which
is difficult to obtain for low-resource languages. In this paper, we propose a
novel weakly supervised scene text generation method that leverages a few
recognition-level labels as weak supervision. The proposed method is able to
generate a large amount of scene text images with diverse backgrounds and font
styles through cross-language generation. Our method disentangles the content
and style features of scene text images, with the former representing textual
information and the latter representing characteristics such as font,
alignment, and background. To preserve the complete content structure of
generated images, we introduce an integrated attention module. Furthermore, to
bridge the style gap in the style of different languages, we incorporate a
pre-trained font classifier. We evaluate our method using state-of-the-art
scene text recognition models. Experiments demonstrate that our generated scene
text significantly improves the scene text recognition accuracy and help
achieve higher accuracy when complemented with other generative methods.
Related papers
- TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles [12.182588762414058]
Scene text editing aims to modify texts on images while maintaining the style of newly generated text similar to the original.
Recent works leverage diffusion models, showing improved results, yet still face challenges.
We present emphTextMastero - a carefully designed multilingual scene text editing architecture based on latent diffusion models (LDMs)
arXiv Detail & Related papers (2024-08-20T08:06:09Z) - Layout Agnostic Scene Text Image Synthesis with Diffusion Models [42.37340959594495]
SceneTextGen is a novel diffusion-based model specifically designed to circumvent the need for a predefined layout stage.
The novelty of SceneTextGen lies in its integration of three key components: a character-level encoder for capturing detailed typographic properties and a character-level instance segmentation model and a word-level spotting model to address the issues of unwanted text generation and minor character inaccuracies.
arXiv Detail & Related papers (2024-06-03T07:20:34Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model [31.819060415422353]
Diff-Text is a training-free scene text generation framework for any language.
Our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
arXiv Detail & Related papers (2023-12-19T15:18:40Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy.
We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Exploring Font-independent Features for Scene Text Recognition [22.34023249700896]
Scene text recognition (STR) has been extensively studied in last few years.
Many recently-proposed methods are specially designed to accommodate the arbitrary shape, layout and orientation of scene texts.
These methods, where font features and content features of characters are tangled, perform poorly in text recognition on scene images with texts in novel font styles.
arXiv Detail & Related papers (2020-09-16T03:36:59Z) - Improving Disentangled Text Representation Learning with
Information-Theoretic Guidance [99.68851329919858]
discrete nature of natural language makes disentangling of textual representations more challenging.
Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text.
Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation.
arXiv Detail & Related papers (2020-06-01T03:36:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.