Reading and Writing: Discriminative and Generative Modeling for
Self-Supervised Text Recognition
- URL: http://arxiv.org/abs/2207.00193v1
- Date: Fri, 1 Jul 2022 03:50:26 GMT
- Title: Reading and Writing: Discriminative and Generative Modeling for
Self-Supervised Text Recognition
- Authors: Mingkun Yang, Minghui Liao, Pu Lu, Jing Wang, Shenggao Zhu, Hualin
Luo, Qi Tian, Xiang Bai
- Abstract summary: We introduce contrastive learning and masked image modeling to learn discrimination and generation of text images.
Our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets.
Our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size.
- Score: 101.60244147302197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing text recognition methods usually need large-scale training data.
Most of them rely on synthetic training data due to the lack of annotated real
images. However, there is a domain gap between the synthetic data and real
data, which limits the performance of the text recognition models. Recent
self-supervised text recognition methods attempted to utilize unlabeled real
images by introducing contrastive learning, which mainly learns the
discrimination of the text images. Inspired by the observation that humans
learn to recognize the texts through both reading and writing, we propose to
learn discrimination and generation by integrating contrastive learning and
masked image modeling in our self-supervised method. The contrastive learning
branch is adopted to learn the discrimination of text images, which imitates
the reading behavior of humans. Meanwhile, masked image modeling is firstly
introduced for text recognition to learn the context generation of the text
images, which is similar to the writing behavior. The experimental results show
that our method outperforms previous self-supervised text recognition methods
by 10.2%-20.2% on irregular scene text recognition datasets. Moreover, our
proposed text recognizer exceeds previous state-of-the-art text recognition
methods by averagely 5.3% on 11 benchmarks, with similar model size. We also
demonstrate that our pre-trained model can be easily applied to other
text-related tasks with obvious performance gain.
Related papers
- LEGO: Self-Supervised Representation Learning for Scene Text Images [32.21085469233465]
We propose a Local Explicit and Global Order-aware self-supervised representation learning method for scene text images.
Inspired by the human cognitive process of learning words, we propose three novel pre-text tasks for LEGO to model sequential, semantic, and structural features.
The LEGO recognizer achieves superior or comparable performance compared to state-of-the-art scene text recognition methods on six benchmarks.
arXiv Detail & Related papers (2024-08-04T14:07:14Z) - JSTR: Judgment Improves Scene Text Recognition [0.0]
We present a method for enhancing the accuracy of scene text recognition tasks by judging whether the image and text match each other.
This method boosts text recognition accuracy by providing explicit feedback on the data that the model is likely to misrecognize.
arXiv Detail & Related papers (2024-04-09T02:55:12Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - WordStylist: Styled Verbatim Handwritten Text Generation with Latent
Diffusion Models [8.334487584550185]
We present a latent diffusion-based method for styled text-to-text-content-image generation on word-level.
Our proposed method is able to generate realistic word image samples from different writer styles.
We show that the proposed model produces samples that are aesthetically pleasing, help boosting text recognition performance, and get similar writer retrieval score as real data.
arXiv Detail & Related papers (2023-03-29T10:19:26Z) - Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning.
CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z) - Content and Style Aware Generation of Text-line Images for Handwriting
Recognition [4.301658883577544]
We propose a generative method for handwritten text-line images conditioned on both visual appearance and textual content.
Our method is able to produce long text-line samples with diverse handwriting styles.
arXiv Detail & Related papers (2022-04-12T05:52:03Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Separating Content from Style Using Adversarial Learning for Recognizing
Text in the Wild [103.51604161298512]
We propose an adversarial learning framework for the generation and recognition of multiple characters in an image.
Our framework can be integrated into recent recognition methods to achieve new state-of-the-art recognition accuracy.
arXiv Detail & Related papers (2020-01-13T12:41:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.